I have a dataset with variable called drug_name. I'm want to subset to observations in which drug_name contains one of 200 strings corresponding to individual molecules. The brute force method is to use index repeatedly. For example, if there were only 3 strings of interest, I could do something like this:
data want; set have; where
index(drug_name, "ACETAMINOPHEN") > 0 or
index(drug_name, "IBUPROFEN") > 0 or
index(drug_name, "DIPHENHYDRAMINE") > 0;
run;
What is the more efficient way to do this? Thanks.
Do you have the list of strings as dataset? Could you post some observations, so that we have something to play with?
SAS can create such a where-expression for you, the following code is hardly tested:
data DrugList;
input name $upcase20.;
datalines;
ACETAMiNOPhEN
IBUProFEN
DIPHENHyDRAMiNE
;
data _null_;
set DrugList end=jobDone;
file "drug_filter.sas";
length Buffer $ 100;
if _n_ = 1 then do;
put 'where';
end;
Buffer = cats('index(drug_name, ', quote(trim(name)), ')');
put ' ' Buffer @;
if jobDone then do;
put;
put ';';
end;
else do;
put ' or ';
end;
run;
data want;
set have;
%include "drug_filter.sas";
run;
It depends on what you mean by "contains". Doe sit mean the variable's value is an exact match to one of the 200 strings, or does it mean that the one of the 200 strings appears somewhere within the variable's value (possibly with other characters before or after).
To clarify, it does not have to be an exact match. I just want to subset to observations where one of the 200 strings appears somewhere within the variable's value.
Then I think you are stuck. You could remove all the comparison operators. This would give you a valid comparison for one of the strings:
where index(var, 'ACETAMINOPHEN') or .....;
Also some functions have additional parameters that could ignore upper vs. lower case which might be a help.
Do you have the list of strings as dataset? Could you post some observations, so that we have something to play with?
SAS can create such a where-expression for you, the following code is hardly tested:
data DrugList;
input name $upcase20.;
datalines;
ACETAMiNOPhEN
IBUProFEN
DIPHENHyDRAMiNE
;
data _null_;
set DrugList end=jobDone;
file "drug_filter.sas";
length Buffer $ 100;
if _n_ = 1 then do;
put 'where';
end;
Buffer = cats('index(drug_name, ', quote(trim(name)), ')');
put ' ' Buffer @;
if jobDone then do;
put;
put ';';
end;
else do;
put ' or ';
end;
run;
data want;
set have;
%include "drug_filter.sas";
run;
This code works, thanks very much!
You can speed things up and simplify your program by using a format or an informat, e.g.:
proc format lib=work;
invalue drug (upcase)
"ACETAMINOPHEN",
"IBUPROFEN",
"DIPHENHYDRAMINE" =1
other=0;
run;
data have;
infile cards truncover;
input text $200.;
cards;
erew ACETAmINOPHEN werwerrewwerrwe
sdfælø klgdsklæ klgsdf
rewqv IBUPROFEN 3432 4
sgsfdsgfdsgfdgs
werreerw DIPHENHYDRAMINE 3422323442
;run;
data want;
set have;
do _N_=1 to countw(text) until(found);
found=input(scan(text,_N_),drug.);
end;
run;
You may need to put some options on the SCAN function to get the words right.
Thanks. I think this would work, but the challenge would be that I would have to create a format with 200 values since I'm looking for 200 strings.
data want;
set have;
if prxmatch('/\b(ACETAMINOPHEN|IBUPROFEN|DIPHENHYDRAMINE)\b/',drug_name );
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.