Hello
I have a large dataset and I have used a where statement to search for a keyword that is contained in a character variable. The keyword is in less than 5% of the rows. When I find a keyword I'd like to also copy other rows that have the same ID. The ID is not unique and each ID is duplicated in around six rows.
I've thought of using an IF THEN statement but I've not been able to make anything work that would simply output the other rows that contain the same ID. I'd really appreciate any advice.
I am using SAS Enterprise Guide (Ver 8.3.8.206).
Code used
data want (compress=yes) ; infile "C:\file_location.txt" dlm="|" dsd firstobs=2 truncover; length ID $20. Num_1 4. Num_2 5. CHAR_1 $5. CHAR_2 $32767. CHAR_3 $32767. CHAR_4 $32767. CHAR_5 $32767. ; input ID Num_1 Num_2 Num_1 CHAR_1 CHAR_2 CHAR_3 CHAR_4 CHAR_5 ; run; DATA output1; SET want; where CHAR_2 contains 'keyword' ; run;
Example of the dataset
Example dataset
EVENT_ID Num_1 Num_2 Char_1 Char_2 Char_3 Char_4 Char_5
145896555 19 25 A Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 A Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896555 19 25 B Text Text Text Text
145896556 19 25 A Text Text Text Text
145896556 19 25 B Text Text Text Text
145896556 19 25 B Text Keyword Text Text Text
145896556 19 25 B Text Text Text Text
145896556 19 25 B Text Text Text Text
145896556 19 25 B Text Text Text Text
145896557 19 25 B Text Text Text Text
145896557 19 25 B Text Text Text Text
145896557 19 25 B Text Text Text Text
145896557 19 25 B Text Text Text Text
I'd like the output dataset to read:
145896556 19 25 A Text Text Text Text
145896556 19 25 B Text Text Text Text
145896556 19 25 B Text Keyword Text Text Text
145896556 19 25 B Text Text Text Text
145896556 19 25 B Text Text Text Text
145896556 19 25 B Text Text Text Text
How about like that:
1) read the data in loop one
2) check every CHAR_* variable for a keyword (inner loop)
3) if keyword is found store ID in a hash table
4) read data second time (the second loop)
5) check if ID in in hash table, if "yes" do output;
data want (compress=yes) ;
infile cards4 dlm="|" dsd truncover;
length
ID $20.
Num_1 8
Num_2 8
CHAR_1 $5.
CHAR_2 $32.
CHAR_3 $32.
CHAR_4 $32.
CHAR_5 $32.
;
input
ID
Num_1
Num_2
CHAR_1
CHAR_2
CHAR_3
CHAR_4
CHAR_5
;
cards4;
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896556|19|25|A|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text Keyword|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145891111|19|25|B|Text|Text|Text Keyword|Text
;;;;
run;
DATA output1;
declare hash H();
H.defineKey("ID");
H.defineDone();
do until(end1);
SET want end=end1;
array c[*] CHAR_:;
do i=lbound(c) to hbound(c); drop i;
if index(c[i], 'Keyword') then
do;
H.replace();
put id=;
leave;
end;
end;
end;
do until(end2);
SET want end=end2;
if 0=H.check() then output;
end;
stop;
run;
Thanks, that works perfectly however I have one problem. Some of the cells start with a tilde ~. I thought I'd solved the problem but the final output is still missing entries that started with the tilde ie if my keyword was 'elephantine' then the output would not include the IDs containing '~elephantine..'
I was only searching for part of the word to scoop up any variations ie I was searching for 'phant' instead of 'elephantine' .
I can just search for '~' on its own and it displays all instances where the tilde ~ is included. There are no error messages.
If you have any suggestions for why this is happening?
Many thanks
@river1 If you just want to search for a substring in a string then the find() function will do this job for you. Using your data the code could look as simple as below:
data work.want;
if _n_=1 then
do;
dcl hash h1(dataset:'work.have(where=(find(cats(char_1,char_2,char_3,char_4,char_5),"keyword","i")))');
h1.defineKey('event_id');
h1.defineDone();
end;
set work.have;
if h1.check()=0;
run;
proc print data=work.want;
run;
With the assumption that
total length of char_1, char_2, char_3, char_4, and char_5
does not exceeds 32767.
Yes, that's the assumption I've made. Else we would need OR conditions ...which potentially would also perform a bit better.
data work.want;
if _n_=1 then
do;
dcl hash h1(dataset:'work.have(where=(
find(char_1,"keyword","i") or
find(char_2,"keyword","i") or
find(char_3,"keyword","i") or
find(char_4,"keyword","i") or
find(char_5,"keyword","i")
))');
h1.defineKey('event_id');
h1.defineDone();
end;
set work.have;
if h1.check()=0;
run;
proc print data=work.want;
run;
Thanks that works perfectly! It outputs the rows and includes the tilde characters.
I've been trying to add a line to exclude keywords to reduce the dataset. I've tried the following but this is not working, do you have suggestions for an alternative?
data work.new; if _n_=1 then do; dcl hash h1(dataset:'work.have(where=(find(cats(var_1),"keyword","i") and not find(cats(var_1), "keyword_2", "i")))'); ))'); h1.defineKey('event_id'); h1.defineDone(); end; set work.have; if h1.check()=0; run;
Example dataset ID Num_1 Num_2 Char_1 Char_2 Char_3 Char_4 Char_5 145896555 19 25 A Text Text Text Text 145896555 19 25 B Text Text Text Text 145896555 19 25 B Text Text Text Text 145896555 19 25 B Text Text Text Text 145896555 19 25 B Keyword_1 Text Text Text 145896555 19 25 B Text Text Text Text 145896556 19 25 A Text Text Text Text 145896556 19 25 B Text Text Text Text 145896556 19 25 B Text Text Text Text 145896556 19 25 B Text Text Text Text 145896557 19 25 B Text Text Text Text 145896557 19 25 B Text Text Text Text 145896557 19 25 B Text Text Text Text 145896557 19 25 A Text Text Text Text 145896557 19 25 B Text Text Text Text 145896558 19 25 B Keyword_1 Text Text Text 145896558 19 25 B Keyword_2 Text Text Text 145896558 19 25 B Text Text Text Text 145896558 19 25 B Text Text Text Text 145896559 19 25 B Text Text Text Text 145896559 19 25 B Text Text Text Text 145896559 19 25 B Text Text Text Text 145896559 19 25 B Text Text Text Text I'd like the output dataset to read: 145896555 19 25 A Text Text Text Text 145896555 19 25 B Text Text Text Text 145896555 19 25 B Text Text Text Text 145896555 19 25 B Text Text Text Text 145896555 19 25 B Keyword_1 Text Text Text 145896555 19 25 B Text Text Text Text
Please provide sample data via fully working SAS datastep code as done below.
data have;
infile datalines truncover dlm='|' dsd;
input ID Num_1 Num_2 (Char_1 Char_2 Char_3 Char_4 Char_5) (:$10.);
datalines;
145896555|19|25|A|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896555|19|25|B|Keyword_1|Text|Text|Text
145896555|19|25|B|Text|Text|Text|Text
145896556|19|25|A|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896556|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896557|19|25|A|Text|Text|Text|Text
145896557|19|25|B|Text|Text|Text|Text
145896558|19|25|B|Keyword_1|Text|Text|Text
145896558|19|25|B|Keyword_2|Text|Text|Text
145896558|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
145896559|19|25|B|Text|Text|Text|Text
;
Your initial requirement was to select all rows with the same ID if at least one row matches your selection criteria. What gets loaded into the hash are all IDs that match this criteria with a single entry per selected ID.
The extension to this requirement is now to exclude specific rows from the selected ones based on additional criteria.
What you can do is further filter the selected rows from your base table with a matching ID in the hash table. Code like below should work.
data work.want;
if _n_=1 then
do;
dcl hash h1(dataset:'work.have(where=(
find(char_1,"keyword","i") or
find(char_2,"keyword","i") or
find(char_3,"keyword","i") or
find(char_4,"keyword","i") or
find(char_5,"keyword","i")
))');
h1.defineKey('ID');
h1.defineDone();
end;
set work.have;
if h1.check()=0 and
find(char_1,"keyword_2","i")=0 and
find(char_2,"keyword_2","i")=0 and
find(char_3,"keyword_2","i")=0 and
find(char_4,"keyword_2","i")=0 and
find(char_5,"keyword_2","i")=0
;
run;
Thank you. This looks much neater than what I ended up with. I ran the code three times. First to select the keyword of interest then I ran it twice to exclude the keywords I did not want ie
data work.WANT;
if _n_ = 1 then
do;
dcl hash h1(dataset: WORK.HAVE(where=(find(cats(VAR_1), "KEYWORD_2", "i") > 0))');
h1.defineKey(VAR_2);
h1.defineDone();
end;
set WORK.HAVE;
if h1.check()ne 0;
run;
Alternatively, if you assume:
1) you know names of char variables in advance
2) sum of lengths of that variables is less than 32767
you can go with SQL:
proc sql;
create table output2 as
select *
from want
where ID in
(
select ID
from want
where index(catx("|",CHAR_1,CHAR_2,CHAR_3,CHAR_4,CHAR_5), 'Keyword')
)
;
quit;
without assumption 2) it would be:
proc sql;
create table output3 as
select *
from want
where ID in
(
select ID
from want
where
index(CHAR_1, 'Keyword')
or index(CHAR_2, 'Keyword')
or index(CHAR_3, 'Keyword')
or index(CHAR_4, 'Keyword')
or index(CHAR_5, 'Keyword')
)
;
quit;
Bart
Thanks, this code was great. But I still had the same problem as above, when there was a tilde ~ character at the start of the cell it did not output
ie '~keyword' would not cause the row to not output.
The tilde ~ characters are present in the want dataset but they are not outputting in the output3 dataset.
I've tried to search for '~keyword' and I've tried to remove the tilde character before running the SQL code but this did not work either.
Data want2 ; set want; Code=compress(Code,'~'); run;
Thanks for your help!
data test;
length x $ 20;
x = "~keyword"; output;
x = "keyword"; output;
x = "~~keyword"; output;
run;
data _null_;
set test;
y=index(x, 'keyword');
put (_all_) (=);
run;
returns:
x=~keyword y=2
x=keyword y=1
x=~~keyword y=3
How does it looks with your data?
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.