BookmarkSubscribeRSS Feed
river1
Obsidian | Level 7

Hello

I have a large dataset and I have used a where statement to search for a keyword that is contained in a character variable. The keyword is in less than 5% of the rows. When I find a keyword I'd like to also copy other rows that have the same ID. The ID is not unique and each ID is duplicated in around six rows. 

I've thought of using an IF THEN statement but I've not been able to make anything work that would simply output the other rows that contain the same ID. I'd really appreciate any advice.

 

I am using SAS Enterprise Guide (Ver 8.3.8.206).

 

 

Code used

data want (compress=yes) ;
infile "C:\file_location.txt" dlm="|" dsd firstobs=2 truncover;
length
  ID $20.
  Num_1 4.
  Num_2 5.
  CHAR_1 $5.
  CHAR_2 $32767.
  CHAR_3 $32767.
  CHAR_4 $32767.
  CHAR_5 $32767.
 ;
input
  ID
  Num_1
  Num_2
  Num_1
  CHAR_1
  CHAR_2
  CHAR_3 
  CHAR_4
  CHAR_5
;
run;

DATA output1;
SET want;
   where CHAR_2 contains 'keyword'  ;
run;

Example of the dataset

 

 

 

Example dataset							
EVENT_ID	Num_1	Num_2	Char_1	Char_2	Char_3	Char_4	Char_5
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896556	19	25	A	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text Keyword	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
							
							
I'd like the output dataset to read:							
145896556	19	25	A	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text Keyword	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text

 

 

 

11 REPLIES 11
yabwon
Onyx | Level 15

How about like that:

1) read the data in loop one

2) check every CHAR_* variable for a keyword (inner loop)

3) if keyword is found store ID in a hash table

4) read data second time (the second loop)

5) check if ID in in hash table, if "yes" do output;

 

data want (compress=yes) ;
infile cards4 dlm="|" dsd truncover;
length
  ID $20.
  Num_1 8
  Num_2 8
  CHAR_1 $5.
  CHAR_2 $32.
  CHAR_3 $32.
  CHAR_4 $32.
  CHAR_5 $32.
 ;
input
  ID
  Num_1
  Num_2
  CHAR_1
  CHAR_2
  CHAR_3 
  CHAR_4
  CHAR_5
;
cards4;
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896556|19|25|A|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text Keyword|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145891111|19|25|B|Text|Text|Text Keyword|Text
;;;;
run;

DATA output1;

declare hash H();
H.defineKey("ID");
H.defineDone();

do until(end1);
  SET want end=end1;
  
  array c[*] CHAR_:;

  do i=lbound(c) to hbound(c); drop i;
    if index(c[i], 'Keyword') then
      do;
        H.replace();
        put id=;
        leave;
      end; 
  end;
end;


do until(end2);
   SET want end=end2;

   if 0=H.check() then output;
end;


stop;
run;
_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



river1
Obsidian | Level 7

Thanks, that works perfectly however I have one problem. Some of the cells start with a tilde ~. I thought I'd solved the problem but the final output is still missing entries that started with the tilde ie if my keyword was 'elephantine' then the output would not include the IDs containing '~elephantine..'

 

I was only searching for part of the word to scoop up any variations ie I was searching for 'phant' instead of 'elephantine' . 

 

I can just search for '~' on its own and it displays all instances where the tilde ~ is included. There are no error messages. 

 

If you have any suggestions for why this is happening?

 

Many thanks

Patrick
Opal | Level 21

@river1 If you just want to search for a substring in a string then the find() function will do this job for you. Using your data the code could look as simple as below:

data work.want;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(find(cats(char_1,char_2,char_3,char_4,char_5),"keyword","i")))');
      h1.defineKey('event_id');
      h1.defineDone();
    end;
  set work.have;
  if h1.check()=0;
run;

proc print data=work.want;
run;

Patrick_0-1700133098550.png

 

yabwon
Onyx | Level 15

With the assumption that 

total length of char_1, char_2, char_3, char_4, and char_5

does not exceeds 32767.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Patrick
Opal | Level 21

Yes, that's the assumption I've made. Else we would need OR conditions ...which potentially would also perform a bit better.

data work.want;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(
                                              find(char_1,"keyword","i") or
                                              find(char_2,"keyword","i") or
                                              find(char_3,"keyword","i") or
                                              find(char_4,"keyword","i") or
                                              find(char_5,"keyword","i")
                                            ))');
      h1.defineKey('event_id');
      h1.defineDone();
    end;
  set work.have;
  if h1.check()=0;
run;

proc print data=work.want;
run;

 

river1
Obsidian | Level 7

Thanks that works perfectly! It outputs the rows and includes the tilde characters.

 

I've been trying to add a line to exclude keywords to reduce the dataset. I've tried the following but this is not working, do you have suggestions for an alternative? 

 

data work.new;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(find(cats(var_1),"keyword","i") and not find(cats(var_1), "keyword_2", "i")))'); ))');
      h1.defineKey('event_id');
      h1.defineDone();
    end;
  set work.have;
  if h1.check()=0;
run;
Example dataset							
ID	Num_1	Num_2	Char_1	Char_2	Char_3	Char_4	Char_5
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Keyword_1	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896556	19	25	A	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	A	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896558	19	25	B	Keyword_1	Text	Text	Text
145896558	19	25	B	Keyword_2	Text	Text	Text
145896558	19	25	B	Text	Text	Text	Text
145896558	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
							
							
I'd like the output dataset to read:							
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Keyword_1	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
Patrick
Opal | Level 21

Please provide sample data via fully working SAS datastep code as done below.

Spoiler
data have;
  infile datalines truncover dlm='|' dsd;
  input ID  Num_1 Num_2 (Char_1  Char_2  Char_3  Char_4  Char_5) (:$10.);
datalines;
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Keyword_1|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896556|19|25|A|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|A|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896558|19|25|B|Keyword_1|Text|Text|Text
145896558|19|25|B|Keyword_2|Text|Text|Text
145896558|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
;

Your initial requirement was to select all rows with the same ID if at least one row matches your selection criteria. What gets loaded into the hash are all IDs that match this criteria with a single entry per selected ID.

The extension to this requirement is now to exclude specific rows from the selected ones based on additional criteria.

 

What you can do is further filter the selected rows from your base table with a matching ID in the hash table. Code like below should work.

data work.want;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(
                                              find(char_1,"keyword","i") or
                                              find(char_2,"keyword","i") or
                                              find(char_3,"keyword","i") or
                                              find(char_4,"keyword","i") or
                                              find(char_5,"keyword","i")
                                            ))');
      h1.defineKey('ID');
      h1.defineDone();
    end;
  set work.have;
  if  h1.check()=0 and
      find(char_1,"keyword_2","i")=0 and
      find(char_2,"keyword_2","i")=0 and
      find(char_3,"keyword_2","i")=0 and
      find(char_4,"keyword_2","i")=0 and
      find(char_5,"keyword_2","i")=0
      ;
run;

Patrick_0-1700304492530.png

 

 

 

 

river1
Obsidian | Level 7

Thank you. This looks much neater than what I ended up with. I ran the code three times. First to select the keyword of interest then I ran it twice to exclude the keywords I did not want ie

 

data work.WANT;

  if _n_ = 1 then

    do;

      dcl hash h1(dataset: WORK.HAVE(where=(find(cats(VAR_1), "KEYWORD_2", "i") > 0))');

      h1.defineKey(VAR_2);

      h1.defineDone();

    end;

  set WORK.HAVE;

  if h1.check()ne 0;

run;

 

 

yabwon
Onyx | Level 15

Alternatively, if you assume:

1) you know names of char variables in advance

2) sum of lengths of that variables is less than 32767

you can go with SQL:

proc sql;
  create table output2 as
  select *
  from want
  where ID in
    (
      select ID 
      from want
      where index(catx("|",CHAR_1,CHAR_2,CHAR_3,CHAR_4,CHAR_5), 'Keyword')
    )
  ;
quit;

without assumption 2) it would be:

proc sql;
  create table output3 as
  select *
  from want
  where ID in
    (
      select ID 
      from want
      where 
           index(CHAR_1, 'Keyword')
        or index(CHAR_2, 'Keyword')
        or index(CHAR_3, 'Keyword')
        or index(CHAR_4, 'Keyword')
        or index(CHAR_5, 'Keyword')
    )
  ;
quit;

 

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



river1
Obsidian | Level 7

Thanks, this code was great. But I still had the same problem as above, when there was a tilde ~ character at the start of the cell it did not output

ie '~keyword' would not cause the row to not output.


The tilde ~ characters are present in the want dataset but they are not outputting in the output3 dataset.

I've tried to search for '~keyword' and I've tried to remove the tilde character before running the SQL code but this did not work either.

 

Data want2 ;
set want;
Code=compress(Code,'~');
run;

Thanks for your help!

 

yabwon
Onyx | Level 15
data test;
length x $ 20;
x = "~keyword"; output;
x = "keyword"; output;
x = "~~keyword"; output;
run;

data _null_;
 set test;
 y=index(x, 'keyword');
 put (_all_) (=);
run;

returns:

x=~keyword y=2
x=keyword y=1
x=~~keyword y=3

How does it looks with your data? 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 3299 views
  • 6 likes
  • 3 in conversation