I am using a scan function to count the number of drugs in a regimen; however, I need to differentiate between ' - ' and '-'. In other words, if there is not a space before and after a hyphen then I want to count that as one drug. I am also listing some drugs that do not get counted. For example, I want initialChemoCount =0 for sipuleucel-t and ziv-afilbercept and initialChemoCount = 2 for 'fluorouracil - oxaliplatin'. I also want to make sure drug names with spaces is counted as 1 drug (e.g. string1 = 'paclitaxel albumin bound' ).
The code below is close but not working because the scan reads 'sipuleucel-t' as two separate drugs and 'fluorouracil - irinotecan - ziv-aflibercept' as 3 when it should be 2 based on exclusions. Any suggestions?
data test ;
*string1 = 'sipuleucel-t' ; /*initialchemocount = 0 because I want to exclude them in the count*/
*string1 = 'ziv-afilbercept' ; /*initialchemocount = 0 because I want to exclude them in the count*/
*string1 = 'paclitaxel albumin bound' ; /*initialchemocount = 1*/
*string1 = 'fluorouracil - oxaliplatin' ; /*initialchemocount = 2*/
string1 = 'fluorouracil - irinotecan - ziv-aflibercept' /*initialchemocount = 2 because I want to exclude ziv-aflibercept*/
/*Count the number of drugs in original drug combo*/
ComboCount = count(string1," - " ) + 1;
/*Count the number of chemo drugs in original drug combo*/
array initchemo{20} $200 ;
initialChemoCount = comboCount ;
do j = 1 to combocount until (p <= 0) ;
call scan(string1,j,p,l," - ") ;
if p=1 then initchemo[j] = substrn(string1, p, l-1); /*use p and l to account for legitimate space e.g. paclitaxel albumin bound */
else initchemo[j] = substrn(string1,p+1,l-1) ;
if index(initchemo[j],'investigational' ) then initialchemocount=. ;
else if initchemo[j] IN ('sipuleucel-t', 'ziv-afilbercept', 'abiraterone', 'enzalutamide', 'interferon alfa-2b', 'radium Ra 223 dichloride' )
then initialchemocount=0 ;
else if prxmatch("m/sipuleucel-t|ziv-afilbercept|ado-trastuzumab|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride/oi",initchemo[j]) > 0
then initialChemoCount = initialchemocount - 1 ;
end ;
run;
PROC PRINT DATA=test ;
VAR string1 initialchemocount combocount chemodrugcount initchemo:;
run;
If I correctly understand what you're trying to do, then all you need is:
data have;
length string1 $255; string1 = 'sipuleucel-t' ; output; string1 = 'ziv-afilbercept' ;output; string1 = 'paclitaxel albumin bound';output; string1 = 'fluorouracil - oxaliplatin' ;output; string1 = 'fluorouracil - irinotecan - ziv-aflibercept';output; run; data test; set have; string1=TRANWRD(string1,'- sipuleucel-t',''); string1=TRANWRD(string1,'sipuleucel-t -',''); string1=TRANWRD(string1,'sipuleucel-t',''); string1=TRANWRD(string1,'- ziv-afilbercept',''); string1=TRANWRD(string1,'ziv-afilbercept -',''); string1=TRANWRD(string1,'ziv-afilbercept',''); /*Count the number of drugs*/ if missing(string1) then ComboCount=0; else ComboCount = count(string1," - " ) + 1; run;
Art, CEO, AnalystFinder.com
Like this?
data HAVE ;
length STRING1 $80;
STRING1 = 'sipuleucel-t ' ; output;
STRING1 = 'ziv-afilbercept' ; output;
STRING1 = 'paclitaxel albumin bound' ;output;
STRING1 = 'fluorouracil - oxaliplatin' ; output;
STRING1 = 'fluorouracil - irinotecan - ziv-afilbercept' ;output;
run;
data WANT;
set HAVE;
%* remove unwanted drugs from list;
STRING2=prxchange('s/(\b(sipuleucel-t|ziv-afilbercept|ado-trastuzumab|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride)\b)/ /oi',-1,STRING1);
%* replace drug names with tilde;
STRING3=prxchange('s/(\w[a-z -]*? (?=[ -]))/~ /oi',-1,STRING2);
%* count tildes;
COUNT=countc(STRING3,'~');
run;
STRING1 | COUNT |
---|---|
sipuleucel-t | 0 |
ziv-afilbercept | 0 |
paclitaxel albumin bound | 1 |
fluorouracil - oxaliplatin | 2 |
fluorouracil - irinotecan - ziv-afilbercept | 2 |
Thank you. This looks like a much simpler way to accomplish what I want. When removing unwanted drugs, does the prxchange use partial matches or do the drugs in the list have to be exact matches? I want to omit all mabs from this count. So any drug that has 'mab' in the name would be excluded (ex. bevacizumab). There are also various 'interferon' drugs that need to be excluded.
prxchange('s/(\w[a-z -]*? (?=[ -]))/~ /oi',-1,STRING2);
Also what does the -1 do in the prxchange function ?
To remove all drugs with mab:
STRING2=prxchange('s/( ?\b(sipuleucel-t|ziv-afilbercept|[a-z-]*mab[a-z-]*|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride)\b ?)/ /oi',-1,STRING1);
-1 seeks as many changes as possible.
1 would just do one replacement (and 2 would do 2 replacements at most)
You can also do the count using the method inspired by @mkeintz to reduce the usage of RegEx.
data HAVE ;
length STRING1 $80;
STRING1 = 'sipuleucel-t ' ; output;
STRING1 = 'ziv-afilbercept' ; output;
STRING1 = 'paclitaxel albumin bound' ;output;
STRING1 = 'fluorouracil - oxaliplatin - bevaci-zumab - gmabn' ; output;
STRING1 = 'fluorouracil - irinotecan - ziv-afilbercept' ;output;
run;
data WANT;
set HAVE;
%* replace drug separator ;
STRING2=transtrn(string1,' - ',':');
%* remove unwanted drugs from list;
STRING3=prxchange('s/(\b(sipuleucel-t|ziv-afilbercept|[a-z-]*mab[a-z-]*|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride)\b:?)//oi',-1,STRING2);
%* count remaining drugs;
COUNT=countw(trimn(STRING3),':');
run;
STRING1 | COUNT |
---|---|
sipuleucel-t | 0 |
ziv-afilbercept | 0 |
paclitaxel albumin bound | 1 |
fluorouracil - oxaliplatin - bevaci-zumab - gmabn | 2 |
fluorouracil - irinotecan - ziv-afilbercept | 2 |
BTW, you mispelled aflibercept in a couple locations as afilbercept (they are all supposed to be the same right?).
data test;
input string1 $60.;
put string1=;
datalines;
sipuleucel-t
ziv-aflibercept
paclitaxel albumin bound
fluorouracil - oxaliplatin
fluorouracil - irinotecan - ziv-aflibercept
run;
data want;
set test;
strng2=transtrn(string1,' - ',':');
strng2=transtrn(strng2,'sipuleucel-t',trimn(''));
strng2=transtrn(strng2,'ziv-aflibercept',trimn(''));
if strng2='' then combo=0;
else combo=countw(trim(strng2),':');
drop strng2;
run;
This program counts the desired terms. I leave the rest of the tasks to you.
I would like to focus on " I need to differentiate between ' - ' and '-'. ":
You can use tranw function to replace the ' - ' into some delimiter like '#' (or any other delimiter)
Then use your code with scan(text,n,'#') to count.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.