Creating a new variable based on repeated common values in two columns

ihtishamsultan · Posted 11-03-2021 10:53 PM

Hello, I am working on a dataset that has 1000's of drugs.

I created a sample dataset and have attached it.

I am trying to create a variable that identifies if a combination of DrugA and DrugB is repeated in the attached dataset. For instance row 1, 11 and 14 have the same combination of drugs A1 and A2 (Also repeated combinations are seen in- row 7 and 13; row 5 and 12; 2 and 15).

I know it can be done for two possible combinations (A1 & A2; B2 & B3) through:

data test;
input Sno DrugA $ DrugB $;
datalines;
1 A1 A2
2 B2 B3
3 C3 C4
4 A2 A4
5 A5 A18
6 A7 A17
7 A9 A81
8 K7 L9
9 L9 L19
10 L8 L3
11 A2 A1
12 A18 A5
13 A81 A9
14 A1 A2
15 B2 B3;


data test1;
set test;
If DrugA= "A1" then C1=1;
else if DrugA="A2" then C1=1;
if DrugB= "A1" then C1=1;
else if DrugB="A2" then C1=1;
If DrugA= "B2" then C1=2; 
else if DrugA="B3" then C1=2; 
if DrugB= "B2" then C1=2; 
else if DrugB="B3" then C1=2;
run;

But I have 1000's of possible combinations in the real dataset and would like to know if there is a shortcut to do this?

Any help is appreciated.

ballardw · Posted 11-03-2021 11:20 PM

You example code doesn't really explain anything. It doesn't actually look at any combination of anything. It is testing piece wise. If DrugB = "A1" you get the exact same result without any consideration of what DrugA might be.

data example;
   input druga $ drugb $;
   If DrugA= "A1" then C1=1;
   else if DrugA="A2" then C1=1;
   if DrugB= "A1" then C1=1;
   else if DrugB="A2" then C1=1;
datalines;
abc A1
.   A1
pdq A1
B   C
;

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

Patrick · Posted 11-04-2021 02:32 AM

Below one way to go. I've created a key with a distinct value for any combination of drugs. I've used md5() to create this key as this allows to define the length of the variable without having to know the lengths of the variables where you store the drug names.

data have;
  input Sno DrugA $ DrugB $;
datalines;
1 A1 A2
2 B2 B3
3 C3 C4
4 A2 A4
5 A5 A18
6 A7 A17
7 A9 A81
8 K7 L9
9 L9 L19
10 L8 L3
11 A2 A1
12 A18 A5
13 A81 A9
14 A1 A2
15 B2 B3
;

data want(drop=_:);
  set have;
  _DrugA=upcase(compress(DrugA));
  _DrugB=upcase(compress(DrugB));
  call sortc(_DrugA,_DrugB);
  hash_key=put(md5(catx('|',_DrugA,_DrugB)),hex32.);
run;

proc sql;
  select 
    sno,
    DrugA,
    DrugB,
    hash_key,
    count(*) as n_same_drugs
  from want
  group by hash_key
  order by sno
  ;
quit;

Creating a new variable based on repeated common values in two columns

Re: Creating a new variable based on repeated common values in two columns

Re: Creating a new variable based on repeated common values in two columns

Creating a new variable based on repeated common values in two columns

Re: Creating a new variable based on repeated common values in two columns

Re: Creating a new variable based on repeated common values in two columns

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away