Hi SAS Friends,
Need to remove variable junk text from a long list of varying strings within a single text variable.
Here is a sample of the text variable with the junk text (HAVE),
What needs to be removed (REMOVE),
and what needs to be kept (WANT).
/***********************************************************/
data Have;
infile datalines truncover;
input Drug_Phrases $100.;
datalines;
2 days Ivermectin
2 mg moxidectin
20% deet insect repellent
3 days Ivermectin
3 drug dose - IDA with second dose of ivermectin
4 mg moxidectin
40 mg Atorvastatin/day for 120 days P.O.
400 μg/kg Ivermectin + 400 mg Albendazole
8 mg moxidectin
ABBV-4083
ALBENDAZOLE 400 Mg ORAL TABLET [ZENTEL]
;
run;
data Remove;
infile datalines truncover;
input Remove $100.;
datalines;
2 days
2 mg
20%
insect repellent
3 days
3 drug dose - IDA with second dose of
4 mg
40 mg
/day for 120 days P.O.
400 μg/kg
+ 400 mg
8 mg
400 Mg ORAL TABLET [ZENTEL]
;
run;
data Want;
infile datalines truncover;
input Drug_Names $100.;
datalines;
Ivermectin
moxidectin
deet
Ivermectin
ivermectin
moxidectin
Atorvastatin
Ivermectin Albendazole
moxidectin
ABBV-4083
ALBENDAZOLE
;
run;
/***************************************************/
I think, given the long list of strings involved, it might be easier to create a "REMOVE" file, listing the text to be removed, then somehow use that file to extract those strings from the HAVE text file. But beyond that don't know where to start.
Appreciate any suggestions,
Thank you !
This seems to work for most of your provided example:
You may need to modify the "remove" values so that each "phrase" is a single value. If you phrase text looks like
"blah blah Drugname other text" then the Remove values should be two lines like
blah blah
other text
as the TRANWRD function used below replace exact matches and cannot infer you may have a value in the middle to work around.
data _null_; set remove end=lastone; if _n_=1 then call execute('data want; set have;'); call execute ('drug_phrases= tranwrd(drug_phrases,'||quote(strip(remove))||',"*");'); if lastone then call execute('drug_phrases=strip(compress(drug_phrases,"*"));run;'); run;
Call Execute creates statements that are placed in an execution buffer that runs after the data step calling it completes. This creates a data step. The _n_=1 is executed only one time to write the boiler plate to start the data step.
The main bit repeated for each value in remove is to write a TRANWRD call replacing the text in the remove with an *. The Strip is needed because otherwise the || concatenate operator will pad the length of the value in remove with spaces and not actually match the value in the Drug_phrases variable. Quote is so the value of removed is valid for the syntax.
When the last record is read from Remove, the Set statement option sets a variable to indicate this, then all the * are removed with the Compress function and the Strip makes sure the remaining text is left justified (no leading spaces).
You might consider modifying the code to actually write a Program file using the File statement and PUT instead of Call execute so you have a reference of the code created and then execute that program file.
This seems to work for most of your provided example:
You may need to modify the "remove" values so that each "phrase" is a single value. If you phrase text looks like
"blah blah Drugname other text" then the Remove values should be two lines like
blah blah
other text
as the TRANWRD function used below replace exact matches and cannot infer you may have a value in the middle to work around.
data _null_; set remove end=lastone; if _n_=1 then call execute('data want; set have;'); call execute ('drug_phrases= tranwrd(drug_phrases,'||quote(strip(remove))||',"*");'); if lastone then call execute('drug_phrases=strip(compress(drug_phrases,"*"));run;'); run;
Call Execute creates statements that are placed in an execution buffer that runs after the data step calling it completes. This creates a data step. The _n_=1 is executed only one time to write the boiler plate to start the data step.
The main bit repeated for each value in remove is to write a TRANWRD call replacing the text in the remove with an *. The Strip is needed because otherwise the || concatenate operator will pad the length of the value in remove with spaces and not actually match the value in the Drug_phrases variable. Quote is so the value of removed is valid for the syntax.
When the last record is read from Remove, the Set statement option sets a variable to indicate this, then all the * are removed with the Compress function and the Strip makes sure the remaining text is left justified (no leading spaces).
You might consider modifying the code to actually write a Program file using the File statement and PUT instead of Call execute so you have a reference of the code created and then execute that program file.
Hi ballardw,
Thank you, this works great !
It will take some time to understand all the parts, but the output is correct
A very cool solution.
Thank you !
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.