Solved: How to conditionally delete data in columns

mtr91 · Posted 06-13-2022 07:39 PM

I have a list of drugs from toxicology reports (table=HAVE). Each row has the toxicology drug result for a give person (id). The columns include standalone drugs, as well as a concatenated parent/metabolite string that I created. For example, id-001 includes diazepam, nordiazepam, and diazepam/nordiazpam.

id	drug1	drug2	drug3	drug4	drug5	drug6	drug7
001	diazepam/nordiazepam	diazepam	morphine	nordiazepam
002	sertraline/norsertraline	methadone	norsertraline	pseudoephedrine	sertraline
003	caffeine	lidocaine	opiates
004	thc/carboxy-thc	citalopram/desmethylcitalopram	caffeine	carboxy-thc	citalopram	desmethylcitalopram	thc
005	thc/carboxy-thc	carboxy-thc	thc
006	caffeine	carboxy-thc	phenethylamine

I want to delete the standalone drugs in the columns when I already have the parent/metabolite included for the row (table=WANT). Such that:

id	drug1	drug2	drug3	drug4	drug5	drug6	drug7
001	diazepam/nordiazepam	morphine
002	sertraline/norsertraline	methadone	pseudoephedrine
003	caffeine	lidocaine	opiates
004	thc/carboxy-thc	citalopram/desmethylcitalopram	caffeine
005	thc/carboxy-thc
006	caffeine	carboxy-thc	phenethylamine

I have thousands of rows, and dozens of parent/metabolite combinations (i.e., thc/carboxy-thc, sertraline/norsertraline/diazepam/nordizepam, fentanyl/norfentanyl), so I need SAS to be able to search across columns and delete the standalone drug strings when a parent/metabolite combination exists for each row.

Thanks!

ballardw · Posted 06-13-2022 10:22 PM

There are some questions you need to consider about how complete your example might be.

The first is are all of your compounds exactly two compounds? If you have 3, 4 or more then you should say so.

Second, and possibly more critical, how consistent is the spelling? I have done a very small amount of working with drug names and found problems in that regard. Spelling includes case: "morphine" is not the same as "Morphine" for comparisons and need to know if code needs to adjust for case or not.

Third, are your compounds ALWAYS indicated by a /? If you have other characters then separating the individual compounds requires knowledge of that use.

This works for the first rows of your example. Note the first data step is to have something to code with. HINT HINT HINT.

data have;
   infile datalines truncover;
   informat id $5. drug1-drug7 $40.;
   input id	drug1	drug2	drug3	drug4	drug5	drug6	drug7;
datalines;
001	diazepam/nordiazepam	diazepam	morphine	nordiazepam	 	 	 
002	sertraline/norsertraline	methadone	norsertraline	pseudoephedrine	sertraline	 	 
003	caffeine	lidocaine	opiates	 	 	 	 
004	thc/carboxy-thc	citalopram/desmethylcitalopram	caffeine	carboxy-thc	citalopram	desmethylcitalopram	thc
005	thc/carboxy-thc	carboxy-thc	thc	 	 	 	 
006	caffeine	carboxy-thc	phenethylamine	 
;



data want;
   set have;
   array d (*) drug1-drug7;
   array c (7);
   /* helper array of indicators of COMPOUND values*/
   do i=1 to dim(d);
      c[i] = (findc(d[i],'/')>0);
   end;
   /* temp variables to hold pieces of the compound*/
   length temp1 temp2 $ 40;
   /* see if any of the array elements are compounds*/
   comp= whichn(1,of c(*));

   if comp>0 then do until (comp=0);
      /*get the individual components of the compounds*/
      temp1=scan(d[comp],1,'/');
      temp2=scan(d[comp],2,'/');
      do i=1 to dim(d);
         if d[i]=temp1 then call missing(d[i]);
         if d[i]=temp2 then call missing(d[i]);
      end;
      /* remove the indicator for the processed compound from the helper array*/
      c[comp]=0;
      /* see if there is another compound*/
      comp= whichn(1,of c(*));
   end;
   /* remove temporary variables*/
   drop i temp1 temp2 c1-c7 comp;
run;

Arrays are a basically tool when you need to do something similar to a bunch of variables. They are shorthand for addressing individual variables by using an array name and a numeric index for the position of the variable of the array.

This code uses a function WHICHN that searches a list of variables, in this case the array C, for presence of the numeric value 1 to know which are the compounds. We then find the first compound, get the pieces, then search through the drug array to find the pieces with the value and use Call missing to remove the value.

After the first compound is addressed we remove the index from the helper array and see if there is another. The Whichn, and the companion for character values Whichc, returns 0 when the target value is not found. So setting the C value to 0 removes it from the list.

Note: This does not do anything to move variables to the left to fill in the missing values.

That is a basic simple task with an array.

View solution in original post

tarheel13 · Posted 06-13-2022 08:55 PM

have you tried anything yet? also, please post the data in a data step.

mtr91 · Posted 06-14-2022 12:53 PM

Sorry! I see now that the data step is important to include when asking for help 🙂
data have;
infile datalines truncover;
informat id $5. drug1-drug7 $40.;
input id drug1 drug2 drug3 drug4 drug5 drug6 drug7;
datalines;
001 diazepam/nordiazepam diazepam morphine nordiazepam
002 sertraline/norsertraline methadone norsertraline pseudoephedrine sertraline
003 caffeine lidocaine opiates
004 thc/carboxy-thc citalopram/desmethylcitalopram caffeine carboxy-thc citalopram desmethylcitalopram thc
005 thc/carboxy-thc carboxy-thc thc
006 caffeine carboxy-thc phenethylamine
;

I hadn't tried anything yet. I couldn't wrap my head around where to begin. I figured I needed some sort of array, but have little experience using them. It looks like we found a solution though.

Thanks!

tarheel13 · Posted 06-14-2022 01:50 PM

that's okay. your present an interesting programming problem. I'm glad people on here found a solution that worked. happy programming!

ballardw · Posted 06-13-2022 10:22 PM

There are some questions you need to consider about how complete your example might be.

The first is are all of your compounds exactly two compounds? If you have 3, 4 or more then you should say so.

Second, and possibly more critical, how consistent is the spelling? I have done a very small amount of working with drug names and found problems in that regard. Spelling includes case: "morphine" is not the same as "Morphine" for comparisons and need to know if code needs to adjust for case or not.

Third, are your compounds ALWAYS indicated by a /? If you have other characters then separating the individual compounds requires knowledge of that use.

This works for the first rows of your example. Note the first data step is to have something to code with. HINT HINT HINT.

data have;
   infile datalines truncover;
   informat id $5. drug1-drug7 $40.;
   input id	drug1	drug2	drug3	drug4	drug5	drug6	drug7;
datalines;
001	diazepam/nordiazepam	diazepam	morphine	nordiazepam	 	 	 
002	sertraline/norsertraline	methadone	norsertraline	pseudoephedrine	sertraline	 	 
003	caffeine	lidocaine	opiates	 	 	 	 
004	thc/carboxy-thc	citalopram/desmethylcitalopram	caffeine	carboxy-thc	citalopram	desmethylcitalopram	thc
005	thc/carboxy-thc	carboxy-thc	thc	 	 	 	 
006	caffeine	carboxy-thc	phenethylamine	 
;



data want;
   set have;
   array d (*) drug1-drug7;
   array c (7);
   /* helper array of indicators of COMPOUND values*/
   do i=1 to dim(d);
      c[i] = (findc(d[i],'/')>0);
   end;
   /* temp variables to hold pieces of the compound*/
   length temp1 temp2 $ 40;
   /* see if any of the array elements are compounds*/
   comp= whichn(1,of c(*));

   if comp>0 then do until (comp=0);
      /*get the individual components of the compounds*/
      temp1=scan(d[comp],1,'/');
      temp2=scan(d[comp],2,'/');
      do i=1 to dim(d);
         if d[i]=temp1 then call missing(d[i]);
         if d[i]=temp2 then call missing(d[i]);
      end;
      /* remove the indicator for the processed compound from the helper array*/
      c[comp]=0;
      /* see if there is another compound*/
      comp= whichn(1,of c(*));
   end;
   /* remove temporary variables*/
   drop i temp1 temp2 c1-c7 comp;
run;

Arrays are a basically tool when you need to do something similar to a bunch of variables. They are shorthand for addressing individual variables by using an array name and a numeric index for the position of the variable of the array.

This code uses a function WHICHN that searches a list of variables, in this case the array C, for presence of the numeric value 1 to know which are the compounds. We then find the first compound, get the pieces, then search through the drug array to find the pieces with the value and use Call missing to remove the value.

After the first compound is addressed we remove the index from the helper array and see if there is another. The Whichn, and the companion for character values Whichc, returns 0 when the target value is not found. So setting the C value to 0 removes it from the list.

Note: This does not do anything to move variables to the left to fill in the missing values.

That is a basic simple task with an array.

mtr91 · Posted 06-14-2022 12:57 PM

There are some circumstances of 3 and 4 compounds, so they are not always only 2. I made some tweaks to your code which account for that and they seemed to work! Everything worked! This is so great, thank you.

data alldrugs06;
set alldrugs05;
array d (*) drug1-drug18;
array c (18);
/* helper array of indicators of COMPOUND values*/
do i=1 to dim(d);
c[i] = (findc(d[i],'/')>0);
end;
/* temp variables to hold pieces of the compound*/
length temp1 temp2 temp3 temp4 $ 40;
/* see if any of the array elements are compounds*/
comp= whichn(1,of c(*));

if comp>0 then do until (comp=0);
/*get the individual components of the compounds*/
temp1=scan(d[comp],1,'/');
temp2=scan(d[comp],2,'/');
temp3=scan(d[comp],3,'/');
temp4=scan(d[comp],4,'/');
do i=1 to dim(d);
if d[i]=temp1 then call missing(d[i]);
if d[i]=temp2 then call missing(d[i]);
if d[i]=temp3 then call missing(d[i]);
if d[i]=temp4 then call missing(d[i]);
end;
/* remove the indicator for the processed compound from the helper array*/
c[comp]=0;
/* see if there is another compound*/
comp= whichn(1,of c(*));
end;
/* remove temporary variables*/
drop i temp1 temp2 temp3 temp4 c1-c18 comp;
run;

Spelling and case is not an issue as those have been addressed and standardized in previous QA work. And the compound are always indicated with a '/'.

I see now the importance of including a data step when asking for help. I am still new to the community board but will be sure to include in future questions 🙂

Much thanks!

-M

ballardw · Posted 06-14-2022 01:18 PM

@mtr91 wrote:

There are some circumstances of 3 and 4 compounds, so they are not always only 2. I made some tweaks to your code which account for that and they seemed to work! Everything worked! This is so great, thank you.

Very good. Your adjustments are exactly what I was thinking about when I asked about longer compounds.

Hint: posting code, or log text, in a text box opened on the forum using the </> icon will preserve simple text formatting. The forum main message windows will do some interesting things with spaces and such and the text box will make reading indented code easier as well as setting it apart from questions/ comments. With log text the text box preserves the format of the text diagnostic characters that SAS often provides for errors and is easier to answer questions with that detail.

mkeintz · Posted 06-13-2022 10:35 PM

Are the parent/metabolite strings always the leftmost entries in each row?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mtr91 · Posted 06-14-2022 12:54 PM

Yes, they are. Thanks!

Ksharp · Posted 06-14-2022 07:52 AM

data have;
   infile datalines truncover;
   informat id $5. drug1-drug7 $40.;
   input id drug1 drug2 drug3 drug4 drug5 drug6 drug7;
datalines;
001 diazepam/nordiazepam diazepam morphine nordiazepam      
002 sertraline/norsertraline methadone norsertraline pseudoephedrine sertraline    
003 caffeine lidocaine opiates        
004 thc/carboxy-thc citalopram/desmethylcitalopram caffeine carboxy-thc citalopram desmethylcitalopram thc
005 thc/carboxy-thc carboxy-thc thc        
006 caffeine carboxy-thc phenethylamine  
;


data want;
 set have;
 array x{*} $ drug:;
 do i=1 to dim(x);
   if not findc(x{i},'/') then do;
     do j=1 to dim(x);
       if i ne j then do;
         if find(x{j},x{i},'it') then call missing(x{i});
    end;
  end;
   end;
 end;
drop i j;
run;

How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Re: How to conditionally delete data in columns

Catch up on SAS Innovate 2026