BookmarkSubscribeRSS Feed
ihtishamsultan
Obsidian | Level 7

Hello, I am working on a dataset that has 1000's of drugs.

 

I created a sample dataset and have attached it.

 

I am trying to create a variable that identifies if a combination of DrugA and DrugB is repeated in the attached dataset. For instance row 1, 11 and 14 have the same combination of drugs A1 and A2 (Also repeated combinations are seen in- row 7 and 13; row 5 and 12; 2 and 15).

 

I know it can be done for two possible combinations (A1 & A2; B2 & B3) through:

data test;
input Sno DrugA $ DrugB $;
datalines;
1 A1 A2
2 B2 B3
3 C3 C4
4 A2 A4
5 A5 A18
6 A7 A17
7 A9 A81
8 K7 L9
9 L9 L19
10 L8 L3
11 A2 A1
12 A18 A5
13 A81 A9
14 A1 A2
15 B2 B3;


data test1;
set test;
If DrugA= "A1" then C1=1;
else if DrugA="A2" then C1=1;
if DrugB= "A1" then C1=1;
else if DrugB="A2" then C1=1;
If DrugA= "B2" then C1=2;
else if DrugA="B3" then C1=2;
if DrugB= "B2" then C1=2;
else if DrugB="B3" then C1=2; run;

But I have 1000's of possible combinations in the real dataset and would like to know if there is a shortcut to do this?

 

Any help is appreciated. 

2 REPLIES 2
ballardw
Super User

You example code doesn't really explain anything. It doesn't actually look at any combination of anything. It is testing piece wise. If DrugB = "A1" you get the exact same result without any consideration of what DrugA might be.

data example;
   input druga $ drugb $;
   If DrugA= "A1" then C1=1;
   else if DrugA="A2" then C1=1;
   if DrugB= "A1" then C1=1;
   else if DrugB="A2" then C1=1;
datalines;
abc A1
.   A1
pdq A1
B   C
;

 

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

Patrick
Opal | Level 21

Below one way to go. I've created a key with a distinct value for any combination of drugs. I've used md5() to create this key as this allows to define the length of the variable without having to know the lengths of the variables where you store the drug names.

data have;
  input Sno DrugA $ DrugB $;
datalines;
1 A1 A2
2 B2 B3
3 C3 C4
4 A2 A4
5 A5 A18
6 A7 A17
7 A9 A81
8 K7 L9
9 L9 L19
10 L8 L3
11 A2 A1
12 A18 A5
13 A81 A9
14 A1 A2
15 B2 B3
;

data want(drop=_:);
  set have;
  _DrugA=upcase(compress(DrugA));
  _DrugB=upcase(compress(DrugB));
  call sortc(_DrugA,_DrugB);
  hash_key=put(md5(catx('|',_DrugA,_DrugB)),hex32.);
run;

proc sql;
  select 
    sno,
    DrugA,
    DrugB,
    hash_key,
    count(*) as n_same_drugs
  from want
  group by hash_key
  order by sno
  ;
quit;

 

Patrick_0-1636007601910.png

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1539 views
  • 1 like
  • 3 in conversation