BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Lefty
Obsidian | Level 7

Hi, I am trying to combine two datasets using Base SAS 9.4.
The first "master" dataset is laid out like this:

DxCode Disease1 Disease2 Disease3 Disease17
B180 0 0 1 0
B181 0 0 1 0
B188 0 0 1 0
etc          

 

I want to combine it with a "patient" data file that looks like this:

PatientID Date DxCode1 DxCode2 DxCode12
1 1/1/2017 A180 B180   D43
1 1/31/2017 C059 C060   I26
1 5/1/2017 D10 C061   D33
1 5/1/2017 E123 U32   S23
2 5/1/2017 I27 S44   P55
2 5/1/2017 S43 I28   B188

 

I want to include only rows of data from the "patient" file where at least one of the values of the variables DX1-DX12 is found in the "master" file and I want to know which disease that DX1-DX12 corresponds to. Ultimately, I would like the resulting data to look like this (and it could have many rows per patient if they have multiple diseases on different dates):

PatientID Date Disease1 Disease2 Disease3 Disease17
1 1/1/2017 0 0 1 0
2 5/1/2017 0 0 1 0

I don't care which "dx" field the disease appeared in or which particular dx code the patient had, just whether or not a dx code within each disease was diagnosed on a particular date. I was thinking I'd use proc sql, like this:

proc sql;

create table want as

select a.*, b.patientID, b.date

from master as a

inner join patient as b

on a.dxcode=b.dxcode1 or b.dxcode2 or b.dxcode3

or b.dxcode4 or b.dxcode5 or b.dxcode6 or b.dxcode7

or b.dxcode8 or b.dxcode9 or b.dxcode10 or b.dxcode11 or b.dxcode12;

 

But that is taking forever (my patient file is very large!) and I'm getting the error about "The execution of this query involves performing one or more Cartesian product joins that cannot be optimized"
I'm hoping there's a more efficient way to code this. Thank you so much in advance!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

As long as your CODES to Disease flags is small enough to fit in memory this looks like a good use for HASH() object.

First let's convert your printouts into actual datasets. Let's add at least one more code that matches.

data codes;
  input DxCode :$5. Disease1-Disease4 ;
cards;
B180 0 0 1 0
B181 0 0 1 0
S43  1 0 0 0
;
 

data have;
  input PatientID Date :mmddyy. (DxCode1-DxCode3) (:$5.);
  format date yymmdd10.;
cards;
1 1/1/2017 A180 B180   D43
1 1/31/2017 C059 C060   I26
1 5/1/2017 D10 C061   D33
1 5/1/2017 E123 U32   S23
2 5/1/2017 I27 S44   P55
2 5/1/2017 S43 I28   B188
;

Now let's combine the HAVE with CODES using a hash and write out one record for every match.  Having repeating values of DATE for the same ID is confusing, but we can deal with it if you want.

data match / view=match;
  set have codes(obs=0);
  if _n_=1 then do;
    declare hash h(dataset:'codes');
    rc=h.definekey('dxcode');
    rc=h.definedata('disease1','disease2','disease3','disease4');
    rc=h.definedone();
  end;
  array dx dxcode1-dxcode3 ;
  any=0;
  do index=1 to dim(dx);
    if not h.find(key:dx[index]) then do;
      any=1;
      output;
    end;
  end;
  if not any then do;
    array flags disease1-disease4;
    do index=1 to dim(flags);
       flags[index]=0;
    end;
    output;
  end;
  drop rc index dx:;
run;

proc summary data=match;
  by patientid;
  var date any disease: ;
  output out=want max= ;
run;

Results:

       Patient
Obs       ID      _TYPE_    _FREQ_          Date    any    Disease1    Disease2    Disease3    Disease4

 1        1          0         4      2017-05-01     1         0           0           1           0
 2        2          0         2      2017-05-01     1         1           0           0           0

View solution in original post

7 REPLIES 7
HB
Barite | Level 11 HB
Barite | Level 11

So you want patientID, date, and disease for each instance where a patient dxcode matches a dxcode in the master table?

 

Edit:

 

OP replied

Yes, exactly!

 

Try this then.  With the data in a more normalized format:

 

data master;
   input dxcode:$4. disease;
datalines;
B180 3
B181 3
B188 3
S43 4
P55 17
;
run;

data patient;
	input id mydate:mmddyy10. dxcode:$4.;
datalines;
1 1/1/2017 A180	
1 1/1/2017 B180
1 1/1/2017 D43
1 1/31/2017 C059
1 1/31/2017 C060
1 1/31/2017 I26
1 5/1/2017 D10
1 5/1/2017 C061
1 5/1/2017 D33
1 5/1/2017 E123
1 5/1/2017 U32
1 5/1/2017 S23
2 5/1/2017 I27
2 5/1/2017 S44
2 5/1/2017 P55
2 5/1/2017 S43
2 5/1/2017 I28
2 5/1/2017 B188
;
run;

We can pursue the SQL solution;

 

proc sql;
	select a.id, a.mydate format date9., a.dxcode, b.disease
	from patient a inner join master b
	on a.dxcode = b.dxcode
	order by 1, 2;
quit;

 

To get 

 

id mydate dxcode disease
1 1-Jan-17 B180 3
2 1-May-17 S43 4
2 1-May-17 P55 17
2 1-May-17 B188 3

 

 

Note I inflicted additional disease on patient 2 for testing purposes.  

 

 

 

Lefty
Obsidian | Level 7

Yes, exactly!

Lefty
Obsidian | Level 7
Thanks for your quick reply!
The issue is, my patient data has 12 diagnosis codes per row. The dataset is 15GB so I'd prefer not to transpose the data from wide to long so I only have 1 diagnosis code. Would it make sense to I modify your sql code like this:
on a.dxcode=b.dxcode1 or a.dxcode=b.dxcode2 etc?
Reeza
Super User
For that first table, is there always only one disease that is flagged or can multiple be flagged?
Lefty
Obsidian | Level 7

Multiple diseases could be flagged and I actually do want to capture that because I want to know which dates the diseases were flagged.

Tom
Super User Tom
Super User

As long as your CODES to Disease flags is small enough to fit in memory this looks like a good use for HASH() object.

First let's convert your printouts into actual datasets. Let's add at least one more code that matches.

data codes;
  input DxCode :$5. Disease1-Disease4 ;
cards;
B180 0 0 1 0
B181 0 0 1 0
S43  1 0 0 0
;
 

data have;
  input PatientID Date :mmddyy. (DxCode1-DxCode3) (:$5.);
  format date yymmdd10.;
cards;
1 1/1/2017 A180 B180   D43
1 1/31/2017 C059 C060   I26
1 5/1/2017 D10 C061   D33
1 5/1/2017 E123 U32   S23
2 5/1/2017 I27 S44   P55
2 5/1/2017 S43 I28   B188
;

Now let's combine the HAVE with CODES using a hash and write out one record for every match.  Having repeating values of DATE for the same ID is confusing, but we can deal with it if you want.

data match / view=match;
  set have codes(obs=0);
  if _n_=1 then do;
    declare hash h(dataset:'codes');
    rc=h.definekey('dxcode');
    rc=h.definedata('disease1','disease2','disease3','disease4');
    rc=h.definedone();
  end;
  array dx dxcode1-dxcode3 ;
  any=0;
  do index=1 to dim(dx);
    if not h.find(key:dx[index]) then do;
      any=1;
      output;
    end;
  end;
  if not any then do;
    array flags disease1-disease4;
    do index=1 to dim(flags);
       flags[index]=0;
    end;
    output;
  end;
  drop rc index dx:;
run;

proc summary data=match;
  by patientid;
  var date any disease: ;
  output out=want max= ;
run;

Results:

       Patient
Obs       ID      _TYPE_    _FREQ_          Date    any    Disease1    Disease2    Disease3    Disease4

 1        1          0         4      2017-05-01     1         0           0           1           0
 2        2          0         2      2017-05-01     1         1           0           0           0

Lefty
Obsidian | Level 7
Holy cow that is so much easier and faster than the sql way! Thank you thank you- I was not familiar with hash but it is amazing!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 470 views
  • 0 likes
  • 4 in conversation