BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ian_K
Calcite | Level 5

I would like to match a maximum of 10 pharmaceuticals to each patient based on the symptoms addressed by the drugs and the symptoms recorded for each patient.

However, there is not a common variable to which to merge on and merging all pharmaceuticals to all patients and using the index function to find suitable drugs is not practical as it would result in a dataset with the number of records equivalent to 200 million drugs x 25 million patients (or 5,000 trillion records).

* THE DATASET PHARMACEUTICALS IN REALITY HAS 200 MILLION RECORDS;

DATA PHARMACEUTICALS;INPUT DRUG_ID SYMPTOM_WORD1 $ SYMPTOM_WORD2 $ SYMPTOM_WORD3 $;

CARDS;

1022 BP ARRYTHMIA

1023 ANGINA

1024 BP ANGINA PALP

;

RUN;

* THE DATASET PATIENTS IN REALITY HAS 25 MILLION RECORDS;

* SYMPTOM_WORDS ARE CONCATENATED TOGETHER AND SEPARTED BY COMMAS;

DATA PATIENTS;INPUT PATIENT_ID AILMENT_WORDS $;

CARDS;

245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS

238761 MIGRAINE,ANGINA

239978 BP,HEADACHE,COUGH,THROAT PAIN

Ideally, if there was a common variable I would merge the two datasets and execute an INDEX function to find drugs that match patient symptoms and set a counter to limit these matches to 10 per patient, like this:

IF INDEX(AILMENT_WORDS, SEARCH_WORD1) GE 1 THEN EVENTMATCH+1;

IF INDEX(AILMENT_WORDS, SEARCH_WORD2) GE 1 THEN EVENTMATCH+1;

IF INDEX(AILMENT_WORDS, SEARCH_WORD3) GE 1 THEN EVENTMATCH+1;

I would set a loop counter using first.patient_ID to ensure no more than 10 drug matches per patient.  However, as I cannot perform the merge “BY PATIENT_ID” as it is not common to both datasets I therefore cannot use first.patient_ID.

Any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

A conventional approach seems the most scalable:

DATA PHARMACEUTICALS;

infile datalines missover;

INPUT DRUG_ID (SYMPTOM_WORD1 SYMPTOM_WORD2 SYMPTOM_WORD3) (:$16.);

datalines;

1022 BP ARRYTHMIA

1023 ANGINA

1024 BP ANGINA PALP

;

data pharm;

set pharmaceuticals;

array symptom_{*} SYMPTOM_WORD:;

do i = 1 to dim(symptom_);

    symptom = symptom_{i};

    if not missing(symptom) then output;

    end;

keep DRUG_ID symptom;

run;

DATA PATIENTS;

INPUT PATIENT_ID AILMENT_WORDS &:$40.;

datalines;

245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS

238761 MIGRAINE,ANGINA

239978 BP,HEADACHE,COUGH,THROAT PAIN

;

data pat;

set PATIENTS;

length ailment $16;

do i = 1 to countw(AILMENT_WORDS, ",");

    ailment = scan(AILMENT_WORDS, i, ",");

    output;

    end;

keep PATIENT_ID ailment;

run;

proc sql;

create table drugs as

select PATIENT_ID, DRUG_ID

from pat as a inner join

    pharm as b on b.symptom = a.ailment

order by PATIENT_ID;

quit;

data drugs10;

set drugs; by PATIENT_ID;

if first.PATIENT_ID then count = 0;

count + 1;

if count <= 10;

drop count;

run;

PG

PG

View solution in original post

18 REPLIES 18
ballardw
Super User

Do you want different lists of 10 drugs for each instance of BP or the same for all BP (and so on for each ailment)?

Ian_K
Calcite | Level 5

Hi Ballardw,

I'm just looking for 10 match drugs for any given ailment.  Therefore, it is possible that some ailments will not be addressed for given patients.

So the end result would be something along the lines of:

User1 drug1

User1 drug2

user1 drug3

...

user1 drug10

I know that sounds strange, but it is a theoretical rather than a practical exercise.

Thank you for replying!

Ian_K
Calcite | Level 5

I'm looking for a list of a maximum of 10 drugs per patient.  Accordingly some ailments will not be addressed, but all patients will be addressed.

KachiM
Rhodochrosite | Level 12

Ian,

You have ailment-words from a Patient and have symptom lists by drug_id. How would the OUTPUT look-like? Should there be the Drug_Id or Symptom_word for the matching Ailment_word? A simple OUTPUT data set with the related PHARMA and PATIENT data sets will easily explain what you want, instead of thousand words of explanation.

Ksharp
Super User

It sounds like you need Hash Table. Hope you have enough Memory to handle it .

PGStats
Opal | Level 21

A conventional approach seems the most scalable:

DATA PHARMACEUTICALS;

infile datalines missover;

INPUT DRUG_ID (SYMPTOM_WORD1 SYMPTOM_WORD2 SYMPTOM_WORD3) (:$16.);

datalines;

1022 BP ARRYTHMIA

1023 ANGINA

1024 BP ANGINA PALP

;

data pharm;

set pharmaceuticals;

array symptom_{*} SYMPTOM_WORD:;

do i = 1 to dim(symptom_);

    symptom = symptom_{i};

    if not missing(symptom) then output;

    end;

keep DRUG_ID symptom;

run;

DATA PATIENTS;

INPUT PATIENT_ID AILMENT_WORDS &:$40.;

datalines;

245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS

238761 MIGRAINE,ANGINA

239978 BP,HEADACHE,COUGH,THROAT PAIN

;

data pat;

set PATIENTS;

length ailment $16;

do i = 1 to countw(AILMENT_WORDS, ",");

    ailment = scan(AILMENT_WORDS, i, ",");

    output;

    end;

keep PATIENT_ID ailment;

run;

proc sql;

create table drugs as

select PATIENT_ID, DRUG_ID

from pat as a inner join

    pharm as b on b.symptom = a.ailment

order by PATIENT_ID;

quit;

data drugs10;

set drugs; by PATIENT_ID;

if first.PATIENT_ID then count = 0;

count + 1;

if count <= 10;

drop count;

run;

PG

PG
Ian_K
Calcite | Level 5

Thank you PG Stats for your partial solution - much appreciated!!

Ian

Each step works fine until we get to the PROC SQLs.  You can see below that at line 44 we end up with zero obs. 

Here is the log file from the SQL steps:

39   proc sql;

40   create table drugs as

41   select PATIENT_ID, DRUG_ID

42   from pat as a inner join

43       pharm as b on b.symptom = a.ailment

44   order by PATIENT_ID;

NOTE: Table WORK.DRUGS created, with 0 rows and 2 columns.

45   quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.31 seconds

      cpu time            0.06 seconds

46

47   data drugs10;

48   set drugs; by PATIENT_ID;

49   if first.PATIENT_ID then count = 0;

50   count + 1;

51   if count <= 10;

52   drop count;

53   run;

NOTE: There were 0 observations read from the data set WORK.DRUGS.

NOTE: The data set WORK.DRUGS10 has 0 observations and 2 variables.

NOTE: DATA statement used (Total process time):

      real time           0.03 seconds

      cpu time            0.01 seconds

PGStats
Opal | Level 21

When I run the code that I posted, I get 4 rows from the SQL step. Note that the join in the SQL step is case-sensitive, it requires exact matches between symptom and ailment words.

PG

PG
PGStats
Opal | Level 21

Note also that it might be a good idea to add the DISTINCT option to the query for patients with many ailments related to the same drug :

proc sql;

create table drugs as

select DISTINCT PATIENT_ID, DRUG_ID

from pat as a inner join

    pharm as b on b.symptom = a.ailment

order by PATIENT_ID;

quit;


PG

PG
Ian_K
Calcite | Level 5

Hello PG Stats!

You are amazing!!  Thank you once again.

I entered your code "exactly as is" (I copy and pasted it into SAS) at part 6 of this conversation and I still get 0 obs.

Sorry to bother you, but do you have any further ideas?

All the best to you,

Ian

Ian_K
Calcite | Level 5

PG STATS had some good suggestions, but they didn't completely work out.

Anybody else have any great ideas to solve this one?

Thank you to all!

Ian

KachiM
Rhodochrosite | Level 12

Ian,

PGStats solution gives the following OUTPUT. Do you want a different output? Please tell what you now get and what you want. It will help others to help you.

       PATIENT_

Obs       ID       DRUG_ID

1      238761       1023

2      238761       1024

3      239978       1022

4      239978       1024

Ian_K
Calcite | Level 5

Hello datasp,

My comment (#7 above shows my LOG) says it all.  When I copy and paste the code as is and run it in SAS I end up with Zero OBS in the SQL steps.  I'm not sure why.... the output you show is exactly what I need, but it just isn't happening for me.

Thank you for replying,

Ian

KachiM
Rhodochrosite | Level 12

Ian,

I suspect your reading the input data sets is not OK. Better use Proc Print to see the values. Proc SQL is OK.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 3897 views
  • 12 likes
  • 5 in conversation