I have a list of drugs from toxicology reports (table=HAVE). Each row has the toxicology drug result for a give person (id). The columns include standalone drugs, as well as a concatenated parent/metabolite string that I created. For example, id-001 includes diazepam, nordiazepam, and diazepam/nordiazpam.
id | drug1 | drug2 | drug3 | drug4 | drug5 | drug6 | drug7 |
001 | diazepam/nordiazepam | diazepam | morphine | nordiazepam | |||
002 | sertraline/norsertraline | methadone | norsertraline | pseudoephedrine | sertraline | ||
003 | caffeine | lidocaine | opiates | ||||
004 | thc/carboxy-thc | citalopram/desmethylcitalopram | caffeine | carboxy-thc | citalopram | desmethylcitalopram | thc |
005 | thc/carboxy-thc | carboxy-thc | thc | ||||
006 | caffeine | carboxy-thc | phenethylamine |
I want to delete the standalone drugs in the columns when I already have the parent/metabolite included for the row (table=WANT). Such that:
id | drug1 | drug2 | drug3 | drug4 | drug5 | drug6 | drug7 |
001 | diazepam/nordiazepam | morphine | |||||
002 | sertraline/norsertraline | methadone | pseudoephedrine | ||||
003 | caffeine | lidocaine | opiates | ||||
004 | thc/carboxy-thc | citalopram/desmethylcitalopram | caffeine | ||||
005 | thc/carboxy-thc | ||||||
006 | caffeine | carboxy-thc | phenethylamine |
I have thousands of rows, and dozens of parent/metabolite combinations (i.e., thc/carboxy-thc, sertraline/norsertraline/diazepam/nordizepam, fentanyl/norfentanyl), so I need SAS to be able to search across columns and delete the standalone drug strings when a parent/metabolite combination exists for each row.
Thanks!
There are some questions you need to consider about how complete your example might be.
The first is are all of your compounds exactly two compounds? If you have 3, 4 or more then you should say so.
Second, and possibly more critical, how consistent is the spelling? I have done a very small amount of working with drug names and found problems in that regard. Spelling includes case: "morphine" is not the same as "Morphine" for comparisons and need to know if code needs to adjust for case or not.
Third, are your compounds ALWAYS indicated by a /? If you have other characters then separating the individual compounds requires knowledge of that use.
This works for the first rows of your example. Note the first data step is to have something to code with. HINT HINT HINT.
data have; infile datalines truncover; informat id $5. drug1-drug7 $40.; input id drug1 drug2 drug3 drug4 drug5 drug6 drug7; datalines; 001 diazepam/nordiazepam diazepam morphine nordiazepam 002 sertraline/norsertraline methadone norsertraline pseudoephedrine sertraline 003 caffeine lidocaine opiates 004 thc/carboxy-thc citalopram/desmethylcitalopram caffeine carboxy-thc citalopram desmethylcitalopram thc 005 thc/carboxy-thc carboxy-thc thc 006 caffeine carboxy-thc phenethylamine ; data want; set have; array d (*) drug1-drug7; array c (7); /* helper array of indicators of COMPOUND values*/ do i=1 to dim(d); c[i] = (findc(d[i],'/')>0); end; /* temp variables to hold pieces of the compound*/ length temp1 temp2 $ 40; /* see if any of the array elements are compounds*/ comp= whichn(1,of c(*)); if comp>0 then do until (comp=0); /*get the individual components of the compounds*/ temp1=scan(d[comp],1,'/'); temp2=scan(d[comp],2,'/'); do i=1 to dim(d); if d[i]=temp1 then call missing(d[i]); if d[i]=temp2 then call missing(d[i]); end; /* remove the indicator for the processed compound from the helper array*/ c[comp]=0; /* see if there is another compound*/ comp= whichn(1,of c(*)); end; /* remove temporary variables*/ drop i temp1 temp2 c1-c7 comp; run;
Arrays are a basically tool when you need to do something similar to a bunch of variables. They are shorthand for addressing individual variables by using an array name and a numeric index for the position of the variable of the array.
This code uses a function WHICHN that searches a list of variables, in this case the array C, for presence of the numeric value 1 to know which are the compounds. We then find the first compound, get the pieces, then search through the drug array to find the pieces with the value and use Call missing to remove the value.
After the first compound is addressed we remove the index from the helper array and see if there is another. The Whichn, and the companion for character values Whichc, returns 0 when the target value is not found. So setting the C value to 0 removes it from the list.
Note: This does not do anything to move variables to the left to fill in the missing values.
That is a basic simple task with an array.
have you tried anything yet? also, please post the data in a data step.
that's okay. your present an interesting programming problem. I'm glad people on here found a solution that worked. happy programming!
There are some questions you need to consider about how complete your example might be.
The first is are all of your compounds exactly two compounds? If you have 3, 4 or more then you should say so.
Second, and possibly more critical, how consistent is the spelling? I have done a very small amount of working with drug names and found problems in that regard. Spelling includes case: "morphine" is not the same as "Morphine" for comparisons and need to know if code needs to adjust for case or not.
Third, are your compounds ALWAYS indicated by a /? If you have other characters then separating the individual compounds requires knowledge of that use.
This works for the first rows of your example. Note the first data step is to have something to code with. HINT HINT HINT.
data have; infile datalines truncover; informat id $5. drug1-drug7 $40.; input id drug1 drug2 drug3 drug4 drug5 drug6 drug7; datalines; 001 diazepam/nordiazepam diazepam morphine nordiazepam 002 sertraline/norsertraline methadone norsertraline pseudoephedrine sertraline 003 caffeine lidocaine opiates 004 thc/carboxy-thc citalopram/desmethylcitalopram caffeine carboxy-thc citalopram desmethylcitalopram thc 005 thc/carboxy-thc carboxy-thc thc 006 caffeine carboxy-thc phenethylamine ; data want; set have; array d (*) drug1-drug7; array c (7); /* helper array of indicators of COMPOUND values*/ do i=1 to dim(d); c[i] = (findc(d[i],'/')>0); end; /* temp variables to hold pieces of the compound*/ length temp1 temp2 $ 40; /* see if any of the array elements are compounds*/ comp= whichn(1,of c(*)); if comp>0 then do until (comp=0); /*get the individual components of the compounds*/ temp1=scan(d[comp],1,'/'); temp2=scan(d[comp],2,'/'); do i=1 to dim(d); if d[i]=temp1 then call missing(d[i]); if d[i]=temp2 then call missing(d[i]); end; /* remove the indicator for the processed compound from the helper array*/ c[comp]=0; /* see if there is another compound*/ comp= whichn(1,of c(*)); end; /* remove temporary variables*/ drop i temp1 temp2 c1-c7 comp; run;
Arrays are a basically tool when you need to do something similar to a bunch of variables. They are shorthand for addressing individual variables by using an array name and a numeric index for the position of the variable of the array.
This code uses a function WHICHN that searches a list of variables, in this case the array C, for presence of the numeric value 1 to know which are the compounds. We then find the first compound, get the pieces, then search through the drug array to find the pieces with the value and use Call missing to remove the value.
After the first compound is addressed we remove the index from the helper array and see if there is another. The Whichn, and the companion for character values Whichc, returns 0 when the target value is not found. So setting the C value to 0 removes it from the list.
Note: This does not do anything to move variables to the left to fill in the missing values.
That is a basic simple task with an array.
There are some circumstances of 3 and 4 compounds, so they are not always only 2. I made some tweaks to your code which account for that and they seemed to work! Everything worked! This is so great, thank you.
data alldrugs06;
set alldrugs05;
array d (*) drug1-drug18;
array c (18);
/* helper array of indicators of COMPOUND values*/
do i=1 to dim(d);
c[i] = (findc(d[i],'/')>0);
end;
/* temp variables to hold pieces of the compound*/
length temp1 temp2 temp3 temp4 $ 40;
/* see if any of the array elements are compounds*/
comp= whichn(1,of c(*));
if comp>0 then do until (comp=0);
/*get the individual components of the compounds*/
temp1=scan(d[comp],1,'/');
temp2=scan(d[comp],2,'/');
temp3=scan(d[comp],3,'/');
temp4=scan(d[comp],4,'/');
do i=1 to dim(d);
if d[i]=temp1 then call missing(d[i]);
if d[i]=temp2 then call missing(d[i]);
if d[i]=temp3 then call missing(d[i]);
if d[i]=temp4 then call missing(d[i]);
end;
/* remove the indicator for the processed compound from the helper array*/
c[comp]=0;
/* see if there is another compound*/
comp= whichn(1,of c(*));
end;
/* remove temporary variables*/
drop i temp1 temp2 temp3 temp4 c1-c18 comp;
run;
Spelling and case is not an issue as those have been addressed and standardized in previous QA work. And the compound are always indicated with a '/'.
I see now the importance of including a data step when asking for help. I am still new to the community board but will be sure to include in future questions 🙂
Much thanks!
-M
@mtr91 wrote:
There are some circumstances of 3 and 4 compounds, so they are not always only 2. I made some tweaks to your code which account for that and they seemed to work! Everything worked! This is so great, thank you.
Very good. Your adjustments are exactly what I was thinking about when I asked about longer compounds.
Hint: posting code, or log text, in a text box opened on the forum using the </> icon will preserve simple text formatting. The forum main message windows will do some interesting things with spaces and such and the text box will make reading indented code easier as well as setting it apart from questions/ comments. With log text the text box preserves the format of the text diagnostic characters that SAS often provides for errors and is easier to answer questions with that detail.
Are the parent/metabolite strings always the leftmost entries in each row?
data have;
infile datalines truncover;
informat id $5. drug1-drug7 $40.;
input id drug1 drug2 drug3 drug4 drug5 drug6 drug7;
datalines;
001 diazepam/nordiazepam diazepam morphine nordiazepam
002 sertraline/norsertraline methadone norsertraline pseudoephedrine sertraline
003 caffeine lidocaine opiates
004 thc/carboxy-thc citalopram/desmethylcitalopram caffeine carboxy-thc citalopram desmethylcitalopram thc
005 thc/carboxy-thc carboxy-thc thc
006 caffeine carboxy-thc phenethylamine
;
data want;
set have;
array x{*} $ drug:;
do i=1 to dim(x);
if not findc(x{i},'/') then do;
do j=1 to dim(x);
if i ne j then do;
if find(x{j},x{i},'it') then call missing(x{i});
end;
end;
end;
end;
drop i j;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.