Hello, I am working on a dataset that has 1000's of drugs.
I created a sample dataset and have attached it.
I am trying to create a variable that identifies if a combination of DrugA and DrugB is repeated in the attached dataset. For instance row 1, 11 and 14 have the same combination of drugs A1 and A2 (Also repeated combinations are seen in- row 7 and 13; row 5 and 12; 2 and 15).
I know it can be done for two possible combinations (A1 & A2; B2 & B3) through:
data test; input Sno DrugA $ DrugB $; datalines; 1 A1 A2 2 B2 B3 3 C3 C4 4 A2 A4 5 A5 A18 6 A7 A17 7 A9 A81 8 K7 L9 9 L9 L19 10 L8 L3 11 A2 A1 12 A18 A5 13 A81 A9 14 A1 A2 15 B2 B3; data test1; set test; If DrugA= "A1" then C1=1; else if DrugA="A2" then C1=1; if DrugB= "A1" then C1=1; else if DrugB="A2" then C1=1;
If DrugA= "B2" then C1=2;
else if DrugA="B3" then C1=2;
if DrugB= "B2" then C1=2;
else if DrugB="B3" then C1=2; run;
But I have 1000's of possible combinations in the real dataset and would like to know if there is a shortcut to do this?
Any help is appreciated.
You example code doesn't really explain anything. It doesn't actually look at any combination of anything. It is testing piece wise. If DrugB = "A1" you get the exact same result without any consideration of what DrugA might be.
data example; input druga $ drugb $; If DrugA= "A1" then C1=1; else if DrugA="A2" then C1=1; if DrugB= "A1" then C1=1; else if DrugB="A2" then C1=1; datalines; abc A1 . A1 pdq A1 B C ;
Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.
Below one way to go. I've created a key with a distinct value for any combination of drugs. I've used md5() to create this key as this allows to define the length of the variable without having to know the lengths of the variables where you store the drug names.
data have;
input Sno DrugA $ DrugB $;
datalines;
1 A1 A2
2 B2 B3
3 C3 C4
4 A2 A4
5 A5 A18
6 A7 A17
7 A9 A81
8 K7 L9
9 L9 L19
10 L8 L3
11 A2 A1
12 A18 A5
13 A81 A9
14 A1 A2
15 B2 B3
;
data want(drop=_:);
set have;
_DrugA=upcase(compress(DrugA));
_DrugB=upcase(compress(DrugB));
call sortc(_DrugA,_DrugB);
hash_key=put(md5(catx('|',_DrugA,_DrugB)),hex32.);
run;
proc sql;
select
sno,
DrugA,
DrugB,
hash_key,
count(*) as n_same_drugs
from want
group by hash_key
order by sno
;
quit;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.