Re: Search for a keyword contained within text and output other rows t...

river1 · Posted 11-15-2023 04:26 AM

Hello

I have a large dataset and I have used a where statement to search for a keyword that is contained in a character variable. The keyword is in less than 5% of the rows. When I find a keyword I'd like to also copy other rows that have the same ID. The ID is not unique and each ID is duplicated in around six rows.

I've thought of using an IF THEN statement but I've not been able to make anything work that would simply output the other rows that contain the same ID. I'd really appreciate any advice.

I am using SAS Enterprise Guide (Ver 8.3.8.206).

Code used

data want (compress=yes) ;
infile "C:\file_location.txt" dlm="|" dsd firstobs=2 truncover;
length
  ID $20.
  Num_1 4.
  Num_2 5.
  CHAR_1 $5.
  CHAR_2 $32767.
  CHAR_3 $32767.
  CHAR_4 $32767.
  CHAR_5 $32767.
 ;
input
  ID
  Num_1
  Num_2
  Num_1
  CHAR_1
  CHAR_2
  CHAR_3 
  CHAR_4
  CHAR_5
;
run;

DATA output1;
SET want;
   where CHAR_2 contains 'keyword'  ;
run;

Example of the dataset

Example dataset							
EVENT_ID	Num_1	Num_2	Char_1	Char_2	Char_3	Char_4	Char_5
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896556	19	25	A	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text Keyword	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
							
							
I'd like the output dataset to read:							
145896556	19	25	A	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text Keyword	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text

yabwon · Posted 11-15-2023 05:22 AM

How about like that:

1) read the data in loop one

2) check every CHAR_* variable for a keyword (inner loop)

3) if keyword is found store ID in a hash table

4) read data second time (the second loop)

5) check if ID in in hash table, if "yes" do output;

data want (compress=yes) ;
infile cards4 dlm="|" dsd truncover;
length
  ID $20.
  Num_1 8
  Num_2 8
  CHAR_1 $5.
  CHAR_2 $32.
  CHAR_3 $32.
  CHAR_4 $32.
  CHAR_5 $32.
 ;
input
  ID
  Num_1
  Num_2
  CHAR_1
  CHAR_2
  CHAR_3 
  CHAR_4
  CHAR_5
;
cards4;
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896556|19|25|A|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text Keyword|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145891111|19|25|B|Text|Text|Text Keyword|Text
;;;;
run;

DATA output1;

declare hash H();
H.defineKey("ID");
H.defineDone();

do until(end1);
  SET want end=end1;
  
  array c[*] CHAR_:;

  do i=lbound(c) to hbound(c); drop i;
    if index(c[i], 'Keyword') then
      do;
        H.replace();
        put id=;
        leave;
      end; 
  end;
end;


do until(end2);
   SET want end=end2;

   if 0=H.check() then output;
end;


stop;
run;

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

river1 · Posted 11-16-2023 04:59 AM

Thanks, that works perfectly however I have one problem. Some of the cells start with a tilde ~. I thought I'd solved the problem but the final output is still missing entries that started with the tilde ie if my keyword was 'elephantine' then the output would not include the IDs containing '~elephantine..'

I was only searching for part of the word to scoop up any variations ie I was searching for 'phant' instead of 'elephantine' .

I can just search for '~' on its own and it displays all instances where the tilde ~ is included. There are no error messages.

If you have any suggestions for why this is happening?

Many thanks

Patrick · Posted 11-16-2023 06:11 AM

@river1 If you just want to search for a substring in a string then the find() function will do this job for you. Using your data the code could look as simple as below:

data work.want;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(find(cats(char_1,char_2,char_3,char_4,char_5),"keyword","i")))');
      h1.defineKey('event_id');
      h1.defineDone();
    end;
  set work.have;
  if h1.check()=0;
run;

proc print data=work.want;
run;

yabwon · Posted 11-16-2023 06:15 AM

With the assumption that

total length of char_1, char_2, char_3, char_4, and char_5

does not exceeds 32767.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Patrick · Posted 11-16-2023 07:13 AM

Yes, that's the assumption I've made. Else we would need OR conditions ...which potentially would also perform a bit better.

data work.want;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(
                                              find(char_1,"keyword","i") or
                                              find(char_2,"keyword","i") or
                                              find(char_3,"keyword","i") or
                                              find(char_4,"keyword","i") or
                                              find(char_5,"keyword","i")
                                            ))');
      h1.defineKey('event_id');
      h1.defineDone();
    end;
  set work.have;
  if h1.check()=0;
run;

proc print data=work.want;
run;

river1 · Posted 11-17-2023 07:02 AM

Thanks that works perfectly! It outputs the rows and includes the tilde characters.

I've been trying to add a line to exclude keywords to reduce the dataset. I've tried the following but this is not working, do you have suggestions for an alternative?

data work.new;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(find(cats(var_1),"keyword","i") and not find(cats(var_1), "keyword_2", "i")))'); ))');
      h1.defineKey('event_id');
      h1.defineDone();
    end;
  set work.have;
  if h1.check()=0;
run;

Example dataset							
ID	Num_1	Num_2	Char_1	Char_2	Char_3	Char_4	Char_5
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Keyword_1	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896556	19	25	A	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896556	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896557	19	25	A	Text	Text	Text	Text
145896557	19	25	B	Text	Text	Text	Text
145896558	19	25	B	Keyword_1	Text	Text	Text
145896558	19	25	B	Keyword_2	Text	Text	Text
145896558	19	25	B	Text	Text	Text	Text
145896558	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
145896559	19	25	B	Text	Text	Text	Text
							
							
I'd like the output dataset to read:							
145896555	19	25	A	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text
145896555	19	25	B	Keyword_1	Text	Text	Text
145896555	19	25	B	Text	Text	Text	Text

Patrick · Posted 11-18-2023 05:48 AM

Please provide sample data via fully working SAS datastep code as done below.

Spoiler

data have;
  infile datalines truncover dlm='|' dsd;
  input ID  Num_1 Num_2 (Char_1  Char_2  Char_3  Char_4  Char_5) (:$10.);
datalines;
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Keyword_1|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896556|19|25|A|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|A|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896558|19|25|B|Keyword_1|Text|Text|Text
145896558|19|25|B|Keyword_2|Text|Text|Text
145896558|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
;

data have; infile datalines truncover dlm='|' dsd; input ID Num_1 Num_2 (Char_1 Char_2 Char_3 Char_4 Char_5) (:$10.); datalines; 145896555|19|25|A|Text|Text|Text|Text 145896555|19|25|B|Text|Text|Text|Text 145896555|19|25|B|Text|Text|Text|Text 145896555|19|25|B|Text|Text|Text|Text 145896555|19|25|B|Keyword_1|Text|Text|Text 145896555|19|25|B|Text|Text|Text|Text 145896556|19|25|A|Text|Text|Text|Text 145896556|19|25|B|Text|Text|Text|Text 145896556|19|25|B|Text|Text|Text|Text 145896556|19|25|B|Text|Text|Text|Text 145896557|19|25|B|Text|Text|Text|Text 145896557|19|25|B|Text|Text|Text|Text 145896557|19|25|B|Text|Text|Text|Text 145896557|19|25|A|Text|Text|Text|Text 145896557|19|25|B|Text|Text|Text|Text 145896558|19|25|B|Keyword_1|Text|Text|Text 145896558|19|25|B|Keyword_2|Text|Text|Text 145896558|19|25|B|Text|Text|Text|Text 145896559|19|25|B|Text|Text|Text|Text 145896559|19|25|B|Text|Text|Text|Text 145896559|19|25|B|Text|Text|Text|Text 145896559|19|25|B|Text|Text|Text|Text ;

Your initial requirement was to select all rows with the same ID if at least one row matches your selection criteria. What gets loaded into the hash are all IDs that match this criteria with a single entry per selected ID.

The extension to this requirement is now to exclude specific rows from the selected ones based on additional criteria.

What you can do is further filter the selected rows from your base table with a matching ID in the hash table. Code like below should work.

data work.want;
  if _n_=1 then 
    do;
      dcl hash h1(dataset:'work.have(where=(
                                              find(char_1,"keyword","i") or
                                              find(char_2,"keyword","i") or
                                              find(char_3,"keyword","i") or
                                              find(char_4,"keyword","i") or
                                              find(char_5,"keyword","i")
                                            ))');
      h1.defineKey('ID');
      h1.defineDone();
    end;
  set work.have;
  if  h1.check()=0 and
      find(char_1,"keyword_2","i")=0 and
      find(char_2,"keyword_2","i")=0 and
      find(char_3,"keyword_2","i")=0 and
      find(char_4,"keyword_2","i")=0 and
      find(char_5,"keyword_2","i")=0
      ;
run;

river1 · Posted 11-20-2023 06:39 PM

Thank you. This looks much neater than what I ended up with. I ran the code three times. First to select the keyword of interest then I ran it twice to exclude the keywords I did not want ie

data work.WANT;

if _n_ = 1 then

do;

dcl hash h1(dataset: WORK.HAVE(where=(find(cats(VAR_1), "KEYWORD_2", "i") > 0))');

h1.defineKey(VAR_2);

h1.defineDone();

end;

set WORK.HAVE;

if h1.check()ne 0;

run;

yabwon · Posted 11-15-2023 05:34 AM

Alternatively, if you assume:

1) you know names of char variables in advance

2) sum of lengths of that variables is less than 32767

you can go with SQL:

proc sql;
  create table output2 as
  select *
  from want
  where ID in
    (
      select ID 
      from want
      where index(catx("|",CHAR_1,CHAR_2,CHAR_3,CHAR_4,CHAR_5), 'Keyword')
    )
  ;
quit;

without assumption 2) it would be:

proc sql;
  create table output3 as
  select *
  from want
  where ID in
    (
      select ID 
      from want
      where 
           index(CHAR_1, 'Keyword')
        or index(CHAR_2, 'Keyword')
        or index(CHAR_3, 'Keyword')
        or index(CHAR_4, 'Keyword')
        or index(CHAR_5, 'Keyword')
    )
  ;
quit;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

river1 · Posted 11-16-2023 05:26 AM

Thanks, this code was great. But I still had the same problem as above, when there was a tilde ~ character at the start of the cell it did not output

ie '~keyword' would not cause the row to not output.

The tilde ~ characters are present in the want dataset but they are not outputting in the output3 dataset.

I've tried to search for '~keyword' and I've tried to remove the tilde character before running the SQL code but this did not work either.

Data want2 ;
set want;
Code=compress(Code,'~');
run;

Thanks for your help!

yabwon · Posted 11-16-2023 05:31 AM

data test;
length x $ 20;
x = "~keyword"; output;
x = "keyword"; output;
x = "~~keyword"; output;
run;

data _null_;
 set test;
 y=index(x, 'keyword');
 put (_all_) (=);
run;

returns:

x=~keyword y=2
x=keyword y=1
x=~~keyword y=3

How does it looks with your data?

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Re: Search for a keyword contained within text and output other rows that have the same ID variable

Registration is open

Registration is open

SAS Training: Just a Click Away