BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
mtr91
Obsidian | Level 7

I have a list of drugs from toxicology reports (table=HAVE). Each row has the toxicology drug result for a give person (id). The columns include standalone drugs, as well as a concatenated parent/metabolite string that I created. For example, id-001 includes diazepam, nordiazepam, and diazepam/nordiazpam. 

iddrug1drug2drug3drug4drug5drug6drug7
001diazepam/nordiazepamdiazepammorphinenordiazepam   
002sertraline/norsertralinemethadonenorsertralinepseudoephedrinesertraline  
003caffeinelidocaineopiates    
004thc/carboxy-thccitalopram/desmethylcitalopramcaffeinecarboxy-thccitalopramdesmethylcitalopramthc
005thc/carboxy-thccarboxy-thcthc    
006caffeinecarboxy-thcphenethylamine    

 

I want to delete the standalone drugs in the columns when I already have the parent/metabolite included for the row (table=WANT). Such that:

iddrug1drug2drug3drug4drug5drug6drug7
001diazepam/nordiazepammorphine     
002sertraline/norsertralinemethadonepseudoephedrine    
003caffeinelidocaineopiates    
004thc/carboxy-thccitalopram/desmethylcitalopramcaffeine    
005thc/carboxy-thc      
006caffeinecarboxy-thcphenethylamine    

 

I have thousands of rows, and dozens of parent/metabolite combinations (i.e., thc/carboxy-thc, sertraline/norsertraline/diazepam/nordizepam, fentanyl/norfentanyl), so I need SAS to be able to search across columns and delete the standalone drug strings when a parent/metabolite combination exists for each row.

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

There are some questions you need to consider about how complete your example might be.

The first is are all of your compounds exactly two compounds? If you have 3, 4 or more then you should say so.

Second, and possibly more critical, how consistent is the spelling? I have done a very small amount of working with drug names and found problems in that regard. Spelling includes case: "morphine" is not the same as "Morphine" for comparisons and need to know if code needs to adjust for case or not.

Third, are your compounds ALWAYS indicated by a /? If you have other characters then separating the individual compounds requires knowledge of that use.

 

 

This works for the first rows of your example. Note the first data step is to have something to code with. HINT HINT HINT.

data have;
   infile datalines truncover;
   informat id $5. drug1-drug7 $40.;
   input id	drug1	drug2	drug3	drug4	drug5	drug6	drug7;
datalines;
001	diazepam/nordiazepam	diazepam	morphine	nordiazepam	 	 	 
002	sertraline/norsertraline	methadone	norsertraline	pseudoephedrine	sertraline	 	 
003	caffeine	lidocaine	opiates	 	 	 	 
004	thc/carboxy-thc	citalopram/desmethylcitalopram	caffeine	carboxy-thc	citalopram	desmethylcitalopram	thc
005	thc/carboxy-thc	carboxy-thc	thc	 	 	 	 
006	caffeine	carboxy-thc	phenethylamine	 
;



data want;
   set have;
   array d (*) drug1-drug7;
   array c (7);
   /* helper array of indicators of COMPOUND values*/
   do i=1 to dim(d);
      c[i] = (findc(d[i],'/')>0);
   end;
   /* temp variables to hold pieces of the compound*/
   length temp1 temp2 $ 40;
   /* see if any of the array elements are compounds*/
   comp= whichn(1,of c(*));

   if comp>0 then do until (comp=0);
      /*get the individual components of the compounds*/
      temp1=scan(d[comp],1,'/');
      temp2=scan(d[comp],2,'/');
      do i=1 to dim(d);
         if d[i]=temp1 then call missing(d[i]);
         if d[i]=temp2 then call missing(d[i]);
      end;
      /* remove the indicator for the processed compound from the helper array*/
      c[comp]=0;
      /* see if there is another compound*/
      comp= whichn(1,of c(*));
   end;
   /* remove temporary variables*/
   drop i temp1 temp2 c1-c7 comp;
run;

Arrays are a basically tool when you need to do something similar to a bunch of variables. They are shorthand for addressing individual variables by using an array name and a numeric index for the position of the variable of the array.

This code uses a function WHICHN that searches a list of variables, in this case the array C, for presence of the numeric value 1 to know which are the compounds. We then find the first compound, get the pieces, then search through the drug array to find the pieces with the value and use Call missing to remove the value.

After the first compound is addressed we remove the index from the helper array and see if there is another. The Whichn, and the companion for character values Whichc, returns 0 when the target value is not found. So setting the C value to 0 removes it from the list.

 

Note: This does not do anything to move variables to the left to fill in the missing values.

That is a basic simple task with an array.

View solution in original post

9 REPLIES 9
tarheel13
Rhodochrosite | Level 12

have you tried anything yet? also, please post the data in a data step.

mtr91
Obsidian | Level 7
Sorry! I see now that the data step is important to include when asking for help 🙂
data have;
infile datalines truncover;
informat id $5. drug1-drug7 $40.;
input id drug1 drug2 drug3 drug4 drug5 drug6 drug7;
datalines;
001 diazepam/nordiazepam diazepam morphine nordiazepam
002 sertraline/norsertraline methadone norsertraline pseudoephedrine sertraline
003 caffeine lidocaine opiates
004 thc/carboxy-thc citalopram/desmethylcitalopram caffeine carboxy-thc citalopram desmethylcitalopram thc
005 thc/carboxy-thc carboxy-thc thc
006 caffeine carboxy-thc phenethylamine
;

I hadn't tried anything yet. I couldn't wrap my head around where to begin. I figured I needed some sort of array, but have little experience using them. It looks like we found a solution though.

Thanks!
tarheel13
Rhodochrosite | Level 12

that's okay. your present an interesting programming problem. I'm glad people on here found a solution that worked. happy programming!

ballardw
Super User

There are some questions you need to consider about how complete your example might be.

The first is are all of your compounds exactly two compounds? If you have 3, 4 or more then you should say so.

Second, and possibly more critical, how consistent is the spelling? I have done a very small amount of working with drug names and found problems in that regard. Spelling includes case: "morphine" is not the same as "Morphine" for comparisons and need to know if code needs to adjust for case or not.

Third, are your compounds ALWAYS indicated by a /? If you have other characters then separating the individual compounds requires knowledge of that use.

 

 

This works for the first rows of your example. Note the first data step is to have something to code with. HINT HINT HINT.

data have;
   infile datalines truncover;
   informat id $5. drug1-drug7 $40.;
   input id	drug1	drug2	drug3	drug4	drug5	drug6	drug7;
datalines;
001	diazepam/nordiazepam	diazepam	morphine	nordiazepam	 	 	 
002	sertraline/norsertraline	methadone	norsertraline	pseudoephedrine	sertraline	 	 
003	caffeine	lidocaine	opiates	 	 	 	 
004	thc/carboxy-thc	citalopram/desmethylcitalopram	caffeine	carboxy-thc	citalopram	desmethylcitalopram	thc
005	thc/carboxy-thc	carboxy-thc	thc	 	 	 	 
006	caffeine	carboxy-thc	phenethylamine	 
;



data want;
   set have;
   array d (*) drug1-drug7;
   array c (7);
   /* helper array of indicators of COMPOUND values*/
   do i=1 to dim(d);
      c[i] = (findc(d[i],'/')>0);
   end;
   /* temp variables to hold pieces of the compound*/
   length temp1 temp2 $ 40;
   /* see if any of the array elements are compounds*/
   comp= whichn(1,of c(*));

   if comp>0 then do until (comp=0);
      /*get the individual components of the compounds*/
      temp1=scan(d[comp],1,'/');
      temp2=scan(d[comp],2,'/');
      do i=1 to dim(d);
         if d[i]=temp1 then call missing(d[i]);
         if d[i]=temp2 then call missing(d[i]);
      end;
      /* remove the indicator for the processed compound from the helper array*/
      c[comp]=0;
      /* see if there is another compound*/
      comp= whichn(1,of c(*));
   end;
   /* remove temporary variables*/
   drop i temp1 temp2 c1-c7 comp;
run;

Arrays are a basically tool when you need to do something similar to a bunch of variables. They are shorthand for addressing individual variables by using an array name and a numeric index for the position of the variable of the array.

This code uses a function WHICHN that searches a list of variables, in this case the array C, for presence of the numeric value 1 to know which are the compounds. We then find the first compound, get the pieces, then search through the drug array to find the pieces with the value and use Call missing to remove the value.

After the first compound is addressed we remove the index from the helper array and see if there is another. The Whichn, and the companion for character values Whichc, returns 0 when the target value is not found. So setting the C value to 0 removes it from the list.

 

Note: This does not do anything to move variables to the left to fill in the missing values.

That is a basic simple task with an array.

mtr91
Obsidian | Level 7

There are some circumstances of 3 and 4 compounds, so they are not always only 2. I made some tweaks to your code which account for that and they seemed to work! Everything worked! This is so great, thank you.

 

data alldrugs06;
set alldrugs05;
array d (*) drug1-drug18;
array c (18);
/* helper array of indicators of COMPOUND values*/
do i=1 to dim(d);
c[i] = (findc(d[i],'/')>0);
end;
/* temp variables to hold pieces of the compound*/
length temp1 temp2 temp3 temp4 $ 40;
/* see if any of the array elements are compounds*/
comp= whichn(1,of c(*));

if comp>0 then do until (comp=0);
/*get the individual components of the compounds*/
temp1=scan(d[comp],1,'/');
temp2=scan(d[comp],2,'/');
temp3=scan(d[comp],3,'/');
temp4=scan(d[comp],4,'/');
do i=1 to dim(d);
if d[i]=temp1 then call missing(d[i]);
if d[i]=temp2 then call missing(d[i]);
if d[i]=temp3 then call missing(d[i]);
if d[i]=temp4 then call missing(d[i]);
end;
/* remove the indicator for the processed compound from the helper array*/
c[comp]=0;
/* see if there is another compound*/
comp= whichn(1,of c(*));
end;
/* remove temporary variables*/
drop i temp1 temp2 temp3 temp4 c1-c18 comp;
run;

 

Spelling and case is not an issue as those have been addressed and standardized in previous QA work. And the compound are always indicated with a '/'.

I see now the importance of including a data step when asking for help. I am still new to the community board but will be sure to include in future questions 🙂

 

Much thanks!

-M

ballardw
Super User

@mtr91 wrote:

There are some circumstances of 3 and 4 compounds, so they are not always only 2. I made some tweaks to your code which account for that and they seemed to work! Everything worked! This is so great, thank you.

 

 


Very good. Your adjustments are exactly what I was thinking about when I asked about longer compounds.

Hint: posting code, or log text, in a text box opened on the forum using the </> icon will preserve simple text formatting. The forum main message windows will do some interesting things with spaces and such and the text box will make reading indented code easier as well as setting it apart from questions/ comments. With log text the text box preserves the format of the text diagnostic characters that SAS often provides for errors and is easier to answer questions with that detail.

 

mkeintz
PROC Star

Are the parent/metabolite strings always the leftmost entries in each row?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mtr91
Obsidian | Level 7
Yes, they are. Thanks!
Ksharp
Super User
data have;
   infile datalines truncover;
   informat id $5. drug1-drug7 $40.;
   input id drug1 drug2 drug3 drug4 drug5 drug6 drug7;
datalines;
001 diazepam/nordiazepam diazepam morphine nordiazepam      
002 sertraline/norsertraline methadone norsertraline pseudoephedrine sertraline    
003 caffeine lidocaine opiates        
004 thc/carboxy-thc citalopram/desmethylcitalopram caffeine carboxy-thc citalopram desmethylcitalopram thc
005 thc/carboxy-thc carboxy-thc thc        
006 caffeine carboxy-thc phenethylamine  
;


data want;
 set have;
 array x{*} $ drug:;
 do i=1 to dim(x);
   if not findc(x{i},'/') then do;
     do j=1 to dim(x);
       if i ne j then do;
         if find(x{j},x{i},'it') then call missing(x{i});
    end;
  end;
   end;
 end;
drop i j;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1902 views
  • 9 likes
  • 5 in conversation