Solved: Re: Conditionally match merging records from two datasets without a co...

Ian_K · Posted 06-29-2015 02:32 PM

I would like to match a maximum of 10 pharmaceuticals to each patient based on the symptoms addressed by the drugs and the symptoms recorded for each patient.

However, there is not a common variable to which to merge on and merging all pharmaceuticals to all patients and using the index function to find suitable drugs is not practical as it would result in a dataset with the number of records equivalent to 200 million drugs x 25 million patients (or 5,000 trillion records).

* THE DATASET PHARMACEUTICALS IN REALITY HAS 200 MILLION RECORDS;

DATA PHARMACEUTICALS;INPUT DRUG_ID SYMPTOM_WORD1 $ SYMPTOM_WORD2 $ SYMPTOM_WORD3 $;

CARDS;

1022 BP ARRYTHMIA

1023 ANGINA

1024 BP ANGINA PALP

;

RUN;

* THE DATASET PATIENTS IN REALITY HAS 25 MILLION RECORDS;

* SYMPTOM_WORDS ARE CONCATENATED TOGETHER AND SEPARTED BY COMMAS;

DATA PATIENTS;INPUT PATIENT_ID AILMENT_WORDS $;

CARDS;

245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS

238761 MIGRAINE,ANGINA

239978 BP,HEADACHE,COUGH,THROAT PAIN

Ideally, if there was a common variable I would merge the two datasets and execute an INDEX function to find drugs that match patient symptoms and set a counter to limit these matches to 10 per patient, like this:

IF INDEX(AILMENT_WORDS, SEARCH_WORD1) GE 1 THEN EVENTMATCH+1;

IF INDEX(AILMENT_WORDS, SEARCH_WORD2) GE 1 THEN EVENTMATCH+1;

IF INDEX(AILMENT_WORDS, SEARCH_WORD3) GE 1 THEN EVENTMATCH+1;

I would set a loop counter using first.patient_ID to ensure no more than 10 drug matches per patient. However, as I cannot perform the merge “BY PATIENT_ID” as it is not common to both datasets I therefore cannot use first.patient_ID.

Any ideas?

PGStats · Posted 06-30-2015 02:05 PM

A conventional approach seems the most scalable:

DATA PHARMACEUTICALS;

infile datalines missover;

INPUT DRUG_ID (SYMPTOM_WORD1 SYMPTOM_WORD2 SYMPTOM_WORD3) (:$16.);

datalines;

1022 BP ARRYTHMIA

1023 ANGINA

1024 BP ANGINA PALP

;

data pharm;

set pharmaceuticals;

array symptom_{*} SYMPTOM_WORD:;

do i = 1 to dim(symptom_);

symptom = symptom_{i};

if not missing(symptom) then output;

end;

keep DRUG_ID symptom;

run;

DATA PATIENTS;

INPUT PATIENT_ID AILMENT_WORDS &:$40.;

datalines;

245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS

238761 MIGRAINE,ANGINA

239978 BP,HEADACHE,COUGH,THROAT PAIN

;

data pat;

set PATIENTS;

length ailment $16;

do i = 1 to countw(AILMENT_WORDS, ",");

ailment = scan(AILMENT_WORDS, i, ",");

output;

end;

keep PATIENT_ID ailment;

run;

proc sql;

create table drugs as

select PATIENT_ID, DRUG_ID

from pat as a inner join

pharm as b on b.symptom = a.ailment

order by PATIENT_ID;

quit;

data drugs10;

set drugs; by PATIENT_ID;

if first.PATIENT_ID then count = 0;

count + 1;

if count <= 10;

drop count;

run;

PG

View solution in original post

ballardw · Posted 06-29-2015 03:43 PM

Do you want different lists of 10 drugs for each instance of BP or the same for all BP (and so on for each ailment)?

Ian_K · Posted 06-29-2015 04:58 PM

Hi Ballardw,

I'm just looking for 10 match drugs for any given ailment. Therefore, it is possible that some ailments will not be addressed for given patients.

So the end result would be something along the lines of:

User1 drug1

User1 drug2

user1 drug3

...

user1 drug10

I know that sounds strange, but it is a theoretical rather than a practical exercise.

Thank you for replying!

Ian_K · Posted 06-29-2015 10:44 PM

I'm looking for a list of a maximum of 10 drugs per patient. Accordingly some ailments will not be addressed, but all patients will be addressed.

KachiM · Posted 06-30-2015 06:47 AM

Ian,

You have ailment-words from a Patient and have symptom lists by drug_id. How would the OUTPUT look-like? Should there be the Drug_Id or Symptom_word for the matching Ailment_word? A simple OUTPUT data set with the related PHARMA and PATIENT data sets will easily explain what you want, instead of thousand words of explanation.

Ksharp · Posted 06-30-2015 10:34 AM

It sounds like you need Hash Table. Hope you have enough Memory to handle it .

PGStats · Posted 06-30-2015 02:05 PM

A conventional approach seems the most scalable:

DATA PHARMACEUTICALS;

infile datalines missover;

INPUT DRUG_ID (SYMPTOM_WORD1 SYMPTOM_WORD2 SYMPTOM_WORD3) (:$16.);

datalines;

1022 BP ARRYTHMIA

1023 ANGINA

1024 BP ANGINA PALP

;

data pharm;

set pharmaceuticals;

array symptom_{*} SYMPTOM_WORD:;

do i = 1 to dim(symptom_);

symptom = symptom_{i};

if not missing(symptom) then output;

end;

keep DRUG_ID symptom;

run;

DATA PATIENTS;

INPUT PATIENT_ID AILMENT_WORDS &:$40.;

datalines;

245762 DIAHREA,NASEAU,ABDOMINAL CRAMPS

238761 MIGRAINE,ANGINA

239978 BP,HEADACHE,COUGH,THROAT PAIN

;

data pat;

set PATIENTS;

length ailment $16;

do i = 1 to countw(AILMENT_WORDS, ",");

ailment = scan(AILMENT_WORDS, i, ",");

output;

end;

keep PATIENT_ID ailment;

run;

proc sql;

create table drugs as

select PATIENT_ID, DRUG_ID

from pat as a inner join

pharm as b on b.symptom = a.ailment

order by PATIENT_ID;

quit;

data drugs10;

set drugs; by PATIENT_ID;

if first.PATIENT_ID then count = 0;

count + 1;

if count <= 10;

drop count;

run;

PG

Ian_K · Posted 06-30-2015 02:27 PM

Thank you PG Stats for your partial solution - much appreciated!!

Ian

Each step works fine until we get to the PROC SQLs. You can see below that at line 44 we end up with zero obs.

Here is the log file from the SQL steps:

39 proc sql;

40 create table drugs as

41 select PATIENT_ID, DRUG_ID

42 from pat as a inner join

43 pharm as b on b.symptom = a.ailment

44 order by PATIENT_ID;

NOTE: Table WORK.DRUGS created, with 0 rows and 2 columns.

45 quit;

NOTE: PROCEDURE SQL used (Total process time):

real time 0.31 seconds

cpu time 0.06 seconds

46

47 data drugs10;

48 set drugs; by PATIENT_ID;

49 if first.PATIENT_ID then count = 0;

50 count + 1;

51 if count <= 10;

52 drop count;

53 run;

NOTE: There were 0 observations read from the data set WORK.DRUGS.

NOTE: The data set WORK.DRUGS10 has 0 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 0.03 seconds

cpu time 0.01 seconds

PGStats · Posted 06-30-2015 03:42 PM

When I run the code that I posted, I get 4 rows from the SQL step. Note that the join in the SQL step is case-sensitive, it requires exact matches between symptom and ailment words.

PG

PGStats · Posted 06-30-2015 03:49 PM

Note also that it might be a good idea to add the DISTINCT option to the query for patients with many ailments related to the same drug :

proc sql;

create table drugs as

select DISTINCT PATIENT_ID, DRUG_ID

from pat as a inner join

pharm as b on b.symptom = a.ailment

order by PATIENT_ID;

quit;

PG

Ian_K · Posted 07-01-2015 02:45 PM

Hello PG Stats!

You are amazing!! Thank you once again.

I entered your code "exactly as is" (I copy and pasted it into SAS) at part 6 of this conversation and I still get 0 obs.

Sorry to bother you, but do you have any further ideas?

All the best to you,

Ian

Ian_K · Posted 07-02-2015 01:20 PM

PG STATS had some good suggestions, but they didn't completely work out.

Anybody else have any great ideas to solve this one?

Thank you to all!

Ian

KachiM · Posted 07-02-2015 01:31 PM

Ian,

PGStats solution gives the following OUTPUT. Do you want a different output? Please tell what you now get and what you want. It will help others to help you.

PATIENT_

Obs ID DRUG_ID

1 238761 1023

2 238761 1024

3 239978 1022

4 239978 1024

Ian_K · Posted 07-02-2015 01:37 PM

Hello datasp,

My comment (#7 above shows my LOG) says it all. When I copy and paste the code as is and run it in SAS I end up with Zero OBS in the SQL steps. I'm not sure why.... the output you show is exactly what I need, but it just isn't happening for me.

Thank you for replying,

Ian

KachiM · Posted 07-02-2015 02:06 PM

Ian,

I suspect your reading the input data sets is not OK. Better use Proc Print to see the values. Proc SQL is OK.

Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Re: Conditionally match merging records from two datasets without a common variable?

Classroom Training Available!