Re: scan function

CP2 · Posted 05-22-2017 09:43 PM

I am using a scan function to count the number of drugs in a regimen; however, I need to differentiate between ' - ' and '-'. In other words, if there is not a space before and after a hyphen then I want to count that as one drug. I am also listing some drugs that do not get counted. For example, I want initialChemoCount =0 for sipuleucel-t and ziv-afilbercept and initialChemoCount = 2 for 'fluorouracil - oxaliplatin'. I also want to make sure drug names with spaces is counted as 1 drug (e.g. string1 = 'paclitaxel albumin bound' ).

The code below is close but not working because the scan reads 'sipuleucel-t' as two separate drugs and 'fluorouracil - irinotecan - ziv-aflibercept' as 3 when it should be 2 based on exclusions. Any suggestions?

data test ;
*string1 = 'sipuleucel-t' ; /*initialchemocount = 0 because I want to exclude them in the count*/

*string1 = 'ziv-afilbercept' ; /*initialchemocount = 0 because I want to exclude them in the count*/

*string1 = 'paclitaxel albumin bound' ; /*initialchemocount = 1*/
*string1 = 'fluorouracil - oxaliplatin' ; /*initialchemocount = 2*/

string1 = 'fluorouracil - irinotecan - ziv-aflibercept' /*initialchemocount = 2 because I want to exclude ziv-aflibercept*/

/*Count the number of drugs in original drug combo*/
ComboCount = count(string1," - " ) + 1;

/*Count the number of chemo drugs in original drug combo*/
array initchemo{20} $200 ;

initialChemoCount = comboCount ;
do j = 1 to combocount until (p <= 0) ;
call scan(string1,j,p,l," - ") ;
if p=1 then initchemo[j] = substrn(string1, p, l-1); /*use p and l to account for legitimate space e.g. paclitaxel albumin bound */
else initchemo[j] = substrn(string1,p+1,l-1) ;

if index(initchemo[j],'investigational' ) then initialchemocount=. ;
else if initchemo[j] IN ('sipuleucel-t', 'ziv-afilbercept', 'abiraterone', 'enzalutamide', 'interferon alfa-2b', 'radium Ra 223 dichloride' )
then initialchemocount=0 ;
else if prxmatch("m/sipuleucel-t|ziv-afilbercept|ado-trastuzumab|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride/oi",initchemo[j]) > 0
then initialChemoCount = initialchemocount - 1 ;
end ;

run;

PROC PRINT DATA=test ;
VAR string1 initialchemocount combocount chemodrugcount initchemo:;
run;

art297 · Posted 05-22-2017 10:36 PM

If I correctly understand what you're trying to do, then all you need is:

data have;
  length string1 $255;
  string1 = 'sipuleucel-t' ; output;
  string1 = 'ziv-afilbercept' ;output;
  string1 = 'paclitaxel albumin bound';output;
  string1 = 'fluorouracil - oxaliplatin' ;output;
  string1 = 'fluorouracil - irinotecan - ziv-aflibercept';output;
 run;

data test;
  set have;
  string1=TRANWRD(string1,'- sipuleucel-t','');
  string1=TRANWRD(string1,'sipuleucel-t -','');
  string1=TRANWRD(string1,'sipuleucel-t','');
  string1=TRANWRD(string1,'- ziv-afilbercept','');
  string1=TRANWRD(string1,'ziv-afilbercept -','');
  string1=TRANWRD(string1,'ziv-afilbercept','');

  /*Count the number of drugs*/
  if missing(string1) then ComboCount=0;
  else ComboCount = count(string1," - " ) + 1;
run;

Art, CEO, AnalystFinder.com

ChrisNZ · Posted 05-22-2017 10:37 PM

Like this?

data HAVE ;
  length STRING1 $80;
  STRING1 = 'sipuleucel-t   ' ; output;
  STRING1 = 'ziv-afilbercept' ;   output;
  STRING1 = 'paclitaxel albumin bound' ;output; 
  STRING1 = 'fluorouracil - oxaliplatin' ; output; 
  STRING1 = 'fluorouracil - irinotecan - ziv-afilbercept' ;output;
run;

data WANT;
  set HAVE;
  %* remove unwanted drugs from list;
  STRING2=prxchange('s/(\b(sipuleucel-t|ziv-afilbercept|ado-trastuzumab|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride)\b)/ /oi',-1,STRING1);
  %* replace drug names with tilde;
  STRING3=prxchange('s/(\w[a-z -]*? (?=[ -]))/~ /oi',-1,STRING2);
  %* count tildes;
  COUNT=countc(STRING3,'~');
run;

STRING1	COUNT
sipuleucel-t	0
ziv-afilbercept	0
paclitaxel albumin bound	1
fluorouracil - oxaliplatin	2
fluorouracil - irinotecan - ziv-afilbercept	2

High-Performance SAS Coding - Third Edition

CP2 · Posted 05-23-2017 09:15 AM

Thank you. This looks like a much simpler way to accomplish what I want. When removing unwanted drugs, does the prxchange use partial matches or do the drugs in the list have to be exact matches? I want to omit all mabs from this count. So any drug that has 'mab' in the name would be excluded (ex. bevacizumab). There are also various 'interferon' drugs that need to be excluded.

prxchange('s/(\w[a-z -]*? (?=[ -]))/~ /oi',-1,STRING2);

Also what does the -1 do in the prxchange function ?

ChrisNZ · Posted 05-23-2017 07:11 PM

To remove all drugs with mab:

  STRING2=prxchange('s/( ?\b(sipuleucel-t|ziv-afilbercept|[a-z-]*mab[a-z-]*|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride)\b ?)/ /oi',-1,STRING1);

-1 seeks as many changes as possible.

1 would just do one replacement (and 2 would do 2 replacements at most)

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-23-2017 07:21 PM

You can also do the count using the method inspired by @mkeintz to reduce the usage of RegEx.

data HAVE ;
  length STRING1 $80;
  STRING1 = 'sipuleucel-t   ' ; output;
  STRING1 = 'ziv-afilbercept' ;   output;
  STRING1 = 'paclitaxel albumin bound' ;output; 
  STRING1 = 'fluorouracil - oxaliplatin - bevaci-zumab - gmabn' ; output; 
  STRING1 = 'fluorouracil - irinotecan - ziv-afilbercept' ;output;
run;

data WANT;
  set HAVE;
  %* replace drug separator ;
  STRING2=transtrn(string1,' - ',':');
  %* remove unwanted drugs from list;
  STRING3=prxchange('s/(\b(sipuleucel-t|ziv-afilbercept|[a-z-]*mab[a-z-]*|interferon|mab|abiraterone|enzalutamide|radium Ra 223 dichloride)\b:?)//oi',-1,STRING2);
  %* count remaining drugs;
  COUNT=countw(trimn(STRING3),':');
run;

STRING1	COUNT
sipuleucel-t	0
ziv-afilbercept	0
paclitaxel albumin bound	1
fluorouracil - oxaliplatin - bevaci-zumab - gmabn	2
fluorouracil - irinotecan - ziv-afilbercept	2

High-Performance SAS Coding - Third Edition

mkeintz · Posted 05-22-2017 10:39 PM

Copy the string to a temporary variable, changing all instances of ' - ' to ':' (space-surrounded dashes to unsurrounded colons) This assumes there are no colons in any drug name.
Remove the unwanted text ('sipuleuce-t' and/or 'ziv-aflibercept')
Use the COUNTW function to count "words", where the word separator is a ':'.

BTW, you mispelled aflibercept in a couple locations as afilbercept (they are all supposed to be the same right?).

data test;
  input string1 $60.;
  put string1=;
datalines;
sipuleucel-t
ziv-aflibercept
paclitaxel albumin bound
fluorouracil - oxaliplatin
fluorouracil - irinotecan - ziv-aflibercept
run;
data want;
  set test;

  strng2=transtrn(string1,' - ',':');

  strng2=transtrn(strng2,'sipuleucel-t',trimn(''));
  strng2=transtrn(strng2,'ziv-aflibercept',trimn(''));

  if strng2='' then combo=0;
  else combo=countw(trim(strng2),':');
  drop strng2;  
run;

This program counts the desired terms. I leave the rest of the tasks to you.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Shmuel · Posted 05-23-2017 12:26 AM

I would like to focus on " I need to differentiate between ' - ' and '-'. ":

You can use tranw function to replace the ' - ' into some delimiter like '#' (or any other delimiter)

Then use your code with scan(text,n,'#') to count.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away