Solved: Re: Combining datasets one to many where many dataset is wide

Lefty · Posted 03-15-2022 01:30 PM

Hi, I am trying to combine two datasets using Base SAS 9.4.
The first "master" dataset is laid out like this:

DxCode	Disease1	Disease2	Disease3	…	Disease17
B180	0	0	1	…	0
B181	0	0	1	…	0
B188	0	0	1	…	0
etc

I want to combine it with a "patient" data file that looks like this:

PatientID	Date	DxCode1	DxCode2	…	DxCode12
1	1/1/2017	A180	B180		D43
1	1/31/2017	C059	C060		I26
1	5/1/2017	D10	C061		D33
1	5/1/2017	E123	U32		S23
2	5/1/2017	I27	S44		P55
2	5/1/2017	S43	I28		B188

I want to include only rows of data from the "patient" file where at least one of the values of the variables DX1-DX12 is found in the "master" file and I want to know which disease that DX1-DX12 corresponds to. Ultimately, I would like the resulting data to look like this (and it could have many rows per patient if they have multiple diseases on different dates):

PatientID	Date	Disease1	Disease2	Disease3	…	Disease17
1	1/1/2017	0	0	1	…	0
2	5/1/2017	0	0	1	…	0

I don't care which "dx" field the disease appeared in or which particular dx code the patient had, just whether or not a dx code within each disease was diagnosed on a particular date. I was thinking I'd use proc sql, like this:

proc sql;

create table want as

select a.*, b.patientID, b.date

from master as a

inner join patient as b

on a.dxcode=b.dxcode1 or b.dxcode2 or b.dxcode3

or b.dxcode4 or b.dxcode5 or b.dxcode6 or b.dxcode7

or b.dxcode8 or b.dxcode9 or b.dxcode10 or b.dxcode11 or b.dxcode12;

But that is taking forever (my patient file is very large!) and I'm getting the error about "The execution of this query involves performing one or more Cartesian product joins that cannot be optimized"
I'm hoping there's a more efficient way to code this. Thank you so much in advance!

Tom · Posted 03-15-2022 02:13 PM

As long as your CODES to Disease flags is small enough to fit in memory this looks like a good use for HASH() object.

First let's convert your printouts into actual datasets. Let's add at least one more code that matches.

data codes;
  input DxCode :$5. Disease1-Disease4 ;
cards;
B180 0 0 1 0
B181 0 0 1 0
S43  1 0 0 0
;
 

data have;
  input PatientID Date :mmddyy. (DxCode1-DxCode3) (:$5.);
  format date yymmdd10.;
cards;
1 1/1/2017 A180 B180   D43
1 1/31/2017 C059 C060   I26
1 5/1/2017 D10 C061   D33
1 5/1/2017 E123 U32   S23
2 5/1/2017 I27 S44   P55
2 5/1/2017 S43 I28   B188
;

Now let's combine the HAVE with CODES using a hash and write out one record for every match. Having repeating values of DATE for the same ID is confusing, but we can deal with it if you want.

data match / view=match;
  set have codes(obs=0);
  if _n_=1 then do;
    declare hash h(dataset:'codes');
    rc=h.definekey('dxcode');
    rc=h.definedata('disease1','disease2','disease3','disease4');
    rc=h.definedone();
  end;
  array dx dxcode1-dxcode3 ;
  any=0;
  do index=1 to dim(dx);
    if not h.find(key:dx[index]) then do;
      any=1;
      output;
    end;
  end;
  if not any then do;
    array flags disease1-disease4;
    do index=1 to dim(flags);
       flags[index]=0;
    end;
    output;
  end;
  drop rc index dx:;
run;

proc summary data=match;
  by patientid;
  var date any disease: ;
  output out=want max= ;
run;

Results:

       Patient
Obs       ID      _TYPE_    _FREQ_          Date    any    Disease1    Disease2    Disease3    Disease4

 1        1          0         4      2017-05-01     1         0           0           1           0
 2        2          0         2      2017-05-01     1         1           0           0           0

View solution in original post

HB · Posted 03-15-2022 01:40 PM

So you want patientID, date, and disease for each instance where a patient dxcode matches a dxcode in the master table?

Edit:

OP replied

Yes, exactly!

Try this then. With the data in a more normalized format:

data master;
   input dxcode:$4. disease;
datalines;
B180 3
B181 3
B188 3
S43 4
P55 17
;
run;

data patient;
	input id mydate:mmddyy10. dxcode:$4.;
datalines;
1 1/1/2017 A180	
1 1/1/2017 B180
1 1/1/2017 D43
1 1/31/2017 C059
1 1/31/2017 C060
1 1/31/2017 I26
1 5/1/2017 D10
1 5/1/2017 C061
1 5/1/2017 D33
1 5/1/2017 E123
1 5/1/2017 U32
1 5/1/2017 S23
2 5/1/2017 I27
2 5/1/2017 S44
2 5/1/2017 P55
2 5/1/2017 S43
2 5/1/2017 I28
2 5/1/2017 B188
;
run;

We can pursue the SQL solution;

proc sql;
	select a.id, a.mydate format date9., a.dxcode, b.disease
	from patient a inner join master b
	on a.dxcode = b.dxcode
	order by 1, 2;
quit;

To get

id	mydate	dxcode	disease
1	1-Jan-17	B180	3
2	1-May-17	S43	4
2	1-May-17	P55	17
2	1-May-17	B188	3

Note I inflicted additional disease on patient 2 for testing purposes.

Lefty · Posted 03-15-2022 01:51 PM

Yes, exactly!

Lefty · Posted 03-15-2022 02:14 PM

Thanks for your quick reply!
The issue is, my patient data has 12 diagnosis codes per row. The dataset is 15GB so I'd prefer not to transpose the data from wide to long so I only have 1 diagnosis code. Would it make sense to I modify your sql code like this:
on a.dxcode=b.dxcode1 or a.dxcode=b.dxcode2 etc?

Reeza · Posted 03-15-2022 01:44 PM

For that first table, is there always only one disease that is flagged or can multiple be flagged?

Lefty · Posted 03-15-2022 01:52 PM

Multiple diseases could be flagged and I actually do want to capture that because I want to know which dates the diseases were flagged.

Tom · Posted 03-15-2022 02:13 PM

As long as your CODES to Disease flags is small enough to fit in memory this looks like a good use for HASH() object.

First let's convert your printouts into actual datasets. Let's add at least one more code that matches.

data codes;
  input DxCode :$5. Disease1-Disease4 ;
cards;
B180 0 0 1 0
B181 0 0 1 0
S43  1 0 0 0
;
 

data have;
  input PatientID Date :mmddyy. (DxCode1-DxCode3) (:$5.);
  format date yymmdd10.;
cards;
1 1/1/2017 A180 B180   D43
1 1/31/2017 C059 C060   I26
1 5/1/2017 D10 C061   D33
1 5/1/2017 E123 U32   S23
2 5/1/2017 I27 S44   P55
2 5/1/2017 S43 I28   B188
;

Now let's combine the HAVE with CODES using a hash and write out one record for every match. Having repeating values of DATE for the same ID is confusing, but we can deal with it if you want.

data match / view=match;
  set have codes(obs=0);
  if _n_=1 then do;
    declare hash h(dataset:'codes');
    rc=h.definekey('dxcode');
    rc=h.definedata('disease1','disease2','disease3','disease4');
    rc=h.definedone();
  end;
  array dx dxcode1-dxcode3 ;
  any=0;
  do index=1 to dim(dx);
    if not h.find(key:dx[index]) then do;
      any=1;
      output;
    end;
  end;
  if not any then do;
    array flags disease1-disease4;
    do index=1 to dim(flags);
       flags[index]=0;
    end;
    output;
  end;
  drop rc index dx:;
run;

proc summary data=match;
  by patientid;
  var date any disease: ;
  output out=want max= ;
run;

Results:

       Patient
Obs       ID      _TYPE_    _FREQ_          Date    any    Disease1    Disease2    Disease3    Disease4

 1        1          0         4      2017-05-01     1         0           0           1           0
 2        2          0         2      2017-05-01     1         1           0           0           0

Lefty · Posted 03-15-2022 02:40 PM

Holy cow that is so much easier and faster than the sql way! Thank you thank you- I was not familiar with hash but it is amazing!

Classroom Training Available!