I would like to match a maximum of 10 pharmaceuticals to each patient based on the symptoms addressed by the drugs and the symptoms recorded for each patient.
However, there is not a common variable to which to merge on and merging all pharmaceuticals to all patients and using the index function to find suitable drugs is not practical as it would result in a dataset with the number of records equivalent to 200 million drugs x 25 million patients (or 5,000 trillion records).
* THE DATASET PHARMACEUTICALS IN REALITY HAS 200 MILLION RECORDS;
DATA PHARMACEUTICALS;INPUT DRUG_ID SYMPTOM_WORD1 $ SYMPTOM_WORD2 $ SYMPTOM_WORD3 $;
CARDS;
1022 BP ARRYTHMIA
1023 ANGINA
1024 BP ANGINA PALP
;
RUN;
* THE DATASET PATIENTS IN REALITY HAS 25 MILLION RECORDS;
* SYMPTOM_WORDS ARE CONCATENATED TOGETHER AND SEPARTED BY COMMAS;
DATA PATIENTS;INPUT PATIENT_ID AILMENT_WORDS $;
CARDS;
245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS
238761 MIGRAINE,ANGINA
239978 BP,HEADACHE,COUGH,THROAT PAIN
Ideally, if there was a common variable I would merge the two datasets and execute an INDEX function to find drugs that match patient symptoms and set a counter to limit these matches to 10 per patient, like this:
IF INDEX(AILMENT_WORDS, SEARCH_WORD1) GE 1 THEN EVENTMATCH+1;
IF INDEX(AILMENT_WORDS, SEARCH_WORD2) GE 1 THEN EVENTMATCH+1;
IF INDEX(AILMENT_WORDS, SEARCH_WORD3) GE 1 THEN EVENTMATCH+1;
I would set a loop counter using first.patient_ID to ensure no more than 10 drug matches per patient. However, as I cannot perform the merge “BY PATIENT_ID” as it is not common to both datasets I therefore cannot use first.patient_ID.
Any ideas?
A conventional approach seems the most scalable:
DATA PHARMACEUTICALS;
infile datalines missover;
INPUT DRUG_ID (SYMPTOM_WORD1 SYMPTOM_WORD2 SYMPTOM_WORD3) (:$16.);
datalines;
1022 BP ARRYTHMIA
1023 ANGINA
1024 BP ANGINA PALP
;
data pharm;
set pharmaceuticals;
array symptom_{*} SYMPTOM_WORD:;
do i = 1 to dim(symptom_);
symptom = symptom_{i};
if not missing(symptom) then output;
end;
keep DRUG_ID symptom;
run;
DATA PATIENTS;
INPUT PATIENT_ID AILMENT_WORDS &:$40.;
datalines;
245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS
238761 MIGRAINE,ANGINA
239978 BP,HEADACHE,COUGH,THROAT PAIN
;
data pat;
set PATIENTS;
length ailment $16;
do i = 1 to countw(AILMENT_WORDS, ",");
ailment = scan(AILMENT_WORDS, i, ",");
output;
end;
keep PATIENT_ID ailment;
run;
proc sql;
create table drugs as
select PATIENT_ID, DRUG_ID
from pat as a inner join
pharm as b on b.symptom = a.ailment
order by PATIENT_ID;
quit;
data drugs10;
set drugs; by PATIENT_ID;
if first.PATIENT_ID then count = 0;
count + 1;
if count <= 10;
drop count;
run;
PG
Do you want different lists of 10 drugs for each instance of BP or the same for all BP (and so on for each ailment)?
Hi Ballardw,
I'm just looking for 10 match drugs for any given ailment. Therefore, it is possible that some ailments will not be addressed for given patients.
So the end result would be something along the lines of:
User1 drug1
User1 drug2
user1 drug3
...
user1 drug10
I know that sounds strange, but it is a theoretical rather than a practical exercise.
Thank you for replying!
I'm looking for a list of a maximum of 10 drugs per patient. Accordingly some ailments will not be addressed, but all patients will be addressed.
Ian,
You have ailment-words from a Patient and have symptom lists by drug_id. How would the OUTPUT look-like? Should there be the Drug_Id or Symptom_word for the matching Ailment_word? A simple OUTPUT data set with the related PHARMA and PATIENT data sets will easily explain what you want, instead of thousand words of explanation.
It sounds like you need Hash Table. Hope you have enough Memory to handle it .
A conventional approach seems the most scalable:
DATA PHARMACEUTICALS;
infile datalines missover;
INPUT DRUG_ID (SYMPTOM_WORD1 SYMPTOM_WORD2 SYMPTOM_WORD3) (:$16.);
datalines;
1022 BP ARRYTHMIA
1023 ANGINA
1024 BP ANGINA PALP
;
data pharm;
set pharmaceuticals;
array symptom_{*} SYMPTOM_WORD:;
do i = 1 to dim(symptom_);
symptom = symptom_{i};
if not missing(symptom) then output;
end;
keep DRUG_ID symptom;
run;
DATA PATIENTS;
INPUT PATIENT_ID AILMENT_WORDS &:$40.;
datalines;
245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS
238761 MIGRAINE,ANGINA
239978 BP,HEADACHE,COUGH,THROAT PAIN
;
data pat;
set PATIENTS;
length ailment $16;
do i = 1 to countw(AILMENT_WORDS, ",");
ailment = scan(AILMENT_WORDS, i, ",");
output;
end;
keep PATIENT_ID ailment;
run;
proc sql;
create table drugs as
select PATIENT_ID, DRUG_ID
from pat as a inner join
pharm as b on b.symptom = a.ailment
order by PATIENT_ID;
quit;
data drugs10;
set drugs; by PATIENT_ID;
if first.PATIENT_ID then count = 0;
count + 1;
if count <= 10;
drop count;
run;
PG
Thank you PG Stats for your partial solution - much appreciated!!
Ian
Each step works fine until we get to the PROC SQLs. You can see below that at line 44 we end up with zero obs.
Here is the log file from the SQL steps:
39 proc sql;
40 create table drugs as
41 select PATIENT_ID, DRUG_ID
42 from pat as a inner join
43 pharm as b on b.symptom = a.ailment
44 order by PATIENT_ID;
NOTE: Table WORK.DRUGS created, with 0 rows and 2 columns.
45 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.31 seconds
cpu time 0.06 seconds
46
47 data drugs10;
48 set drugs; by PATIENT_ID;
49 if first.PATIENT_ID then count = 0;
50 count + 1;
51 if count <= 10;
52 drop count;
53 run;
NOTE: There were 0 observations read from the data set WORK.DRUGS.
NOTE: The data set WORK.DRUGS10 has 0 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
When I run the code that I posted, I get 4 rows from the SQL step. Note that the join in the SQL step is case-sensitive, it requires exact matches between symptom and ailment words.
PG
Note also that it might be a good idea to add the DISTINCT option to the query for patients with many ailments related to the same drug :
proc sql;
create table drugs as
select DISTINCT PATIENT_ID, DRUG_ID
from pat as a inner join
pharm as b on b.symptom = a.ailment
order by PATIENT_ID;
quit;
PG
Hello PG Stats!
You are amazing!! Thank you once again.
I entered your code "exactly as is" (I copy and pasted it into SAS) at part 6 of this conversation and I still get 0 obs.
Sorry to bother you, but do you have any further ideas?
All the best to you,
Ian
PG STATS had some good suggestions, but they didn't completely work out.
Anybody else have any great ideas to solve this one?
Thank you to all!
Ian
Ian,
PGStats solution gives the following OUTPUT. Do you want a different output? Please tell what you now get and what you want. It will help others to help you.
PATIENT_
Obs ID DRUG_ID
1 238761 1023
2 238761 1024
3 239978 1022
4 239978 1024
Hello datasp,
My comment (#7 above shows my LOG) says it all. When I copy and paste the code as is and run it in SAS I end up with Zero OBS in the SQL steps. I'm not sure why.... the output you show is exactly what I need, but it just isn't happening for me.
Thank you for replying,
Ian
Ian,
I suspect your reading the input data sets is not OK. Better use Proc Print to see the values. Proc SQL is OK.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
