Hi, I am trying to combine two datasets using Base SAS 9.4.
The first "master" dataset is laid out like this:
DxCode | Disease1 | Disease2 | Disease3 | … | Disease17 |
B180 | 0 | 0 | 1 | … | 0 |
B181 | 0 | 0 | 1 | … | 0 |
B188 | 0 | 0 | 1 | … | 0 |
etc |
I want to combine it with a "patient" data file that looks like this:
PatientID | Date | DxCode1 | DxCode2 | … | DxCode12 |
1 | 1/1/2017 | A180 | B180 | D43 | |
1 | 1/31/2017 | C059 | C060 | I26 | |
1 | 5/1/2017 | D10 | C061 | D33 | |
1 | 5/1/2017 | E123 | U32 | S23 | |
2 | 5/1/2017 | I27 | S44 | P55 | |
2 | 5/1/2017 | S43 | I28 | B188 |
I want to include only rows of data from the "patient" file where at least one of the values of the variables DX1-DX12 is found in the "master" file and I want to know which disease that DX1-DX12 corresponds to. Ultimately, I would like the resulting data to look like this (and it could have many rows per patient if they have multiple diseases on different dates):
PatientID | Date | Disease1 | Disease2 | Disease3 | … | Disease17 |
1 | 1/1/2017 | 0 | 0 | 1 | … | 0 |
2 | 5/1/2017 | 0 | 0 | 1 | … | 0 |
I don't care which "dx" field the disease appeared in or which particular dx code the patient had, just whether or not a dx code within each disease was diagnosed on a particular date. I was thinking I'd use proc sql, like this:
proc sql;
create table want as
select a.*, b.patientID, b.date
from master as a
inner join patient as b
on a.dxcode=b.dxcode1 or b.dxcode2 or b.dxcode3
or b.dxcode4 or b.dxcode5 or b.dxcode6 or b.dxcode7
or b.dxcode8 or b.dxcode9 or b.dxcode10 or b.dxcode11 or b.dxcode12;
But that is taking forever (my patient file is very large!) and I'm getting the error about "The execution of this query involves performing one or more Cartesian product joins that cannot be optimized"
I'm hoping there's a more efficient way to code this. Thank you so much in advance!
As long as your CODES to Disease flags is small enough to fit in memory this looks like a good use for HASH() object.
First let's convert your printouts into actual datasets. Let's add at least one more code that matches.
data codes;
input DxCode :$5. Disease1-Disease4 ;
cards;
B180 0 0 1 0
B181 0 0 1 0
S43 1 0 0 0
;
data have;
input PatientID Date :mmddyy. (DxCode1-DxCode3) (:$5.);
format date yymmdd10.;
cards;
1 1/1/2017 A180 B180 D43
1 1/31/2017 C059 C060 I26
1 5/1/2017 D10 C061 D33
1 5/1/2017 E123 U32 S23
2 5/1/2017 I27 S44 P55
2 5/1/2017 S43 I28 B188
;
Now let's combine the HAVE with CODES using a hash and write out one record for every match. Having repeating values of DATE for the same ID is confusing, but we can deal with it if you want.
data match / view=match;
set have codes(obs=0);
if _n_=1 then do;
declare hash h(dataset:'codes');
rc=h.definekey('dxcode');
rc=h.definedata('disease1','disease2','disease3','disease4');
rc=h.definedone();
end;
array dx dxcode1-dxcode3 ;
any=0;
do index=1 to dim(dx);
if not h.find(key:dx[index]) then do;
any=1;
output;
end;
end;
if not any then do;
array flags disease1-disease4;
do index=1 to dim(flags);
flags[index]=0;
end;
output;
end;
drop rc index dx:;
run;
proc summary data=match;
by patientid;
var date any disease: ;
output out=want max= ;
run;
Results:
Patient Obs ID _TYPE_ _FREQ_ Date any Disease1 Disease2 Disease3 Disease4 1 1 0 4 2017-05-01 1 0 0 1 0 2 2 0 2 2017-05-01 1 1 0 0 0
So you want patientID, date, and disease for each instance where a patient dxcode matches a dxcode in the master table?
Edit:
OP replied
Yes, exactly!
Try this then. With the data in a more normalized format:
data master;
input dxcode:$4. disease;
datalines;
B180 3
B181 3
B188 3
S43 4
P55 17
;
run;
data patient;
input id mydate:mmddyy10. dxcode:$4.;
datalines;
1 1/1/2017 A180
1 1/1/2017 B180
1 1/1/2017 D43
1 1/31/2017 C059
1 1/31/2017 C060
1 1/31/2017 I26
1 5/1/2017 D10
1 5/1/2017 C061
1 5/1/2017 D33
1 5/1/2017 E123
1 5/1/2017 U32
1 5/1/2017 S23
2 5/1/2017 I27
2 5/1/2017 S44
2 5/1/2017 P55
2 5/1/2017 S43
2 5/1/2017 I28
2 5/1/2017 B188
;
run;
We can pursue the SQL solution;
proc sql;
select a.id, a.mydate format date9., a.dxcode, b.disease
from patient a inner join master b
on a.dxcode = b.dxcode
order by 1, 2;
quit;
To get
id | mydate | dxcode | disease |
1 | 1-Jan-17 | B180 | 3 |
2 | 1-May-17 | S43 | 4 |
2 | 1-May-17 | P55 | 17 |
2 | 1-May-17 | B188 | 3 |
Note I inflicted additional disease on patient 2 for testing purposes.
Yes, exactly!
Multiple diseases could be flagged and I actually do want to capture that because I want to know which dates the diseases were flagged.
As long as your CODES to Disease flags is small enough to fit in memory this looks like a good use for HASH() object.
First let's convert your printouts into actual datasets. Let's add at least one more code that matches.
data codes;
input DxCode :$5. Disease1-Disease4 ;
cards;
B180 0 0 1 0
B181 0 0 1 0
S43 1 0 0 0
;
data have;
input PatientID Date :mmddyy. (DxCode1-DxCode3) (:$5.);
format date yymmdd10.;
cards;
1 1/1/2017 A180 B180 D43
1 1/31/2017 C059 C060 I26
1 5/1/2017 D10 C061 D33
1 5/1/2017 E123 U32 S23
2 5/1/2017 I27 S44 P55
2 5/1/2017 S43 I28 B188
;
Now let's combine the HAVE with CODES using a hash and write out one record for every match. Having repeating values of DATE for the same ID is confusing, but we can deal with it if you want.
data match / view=match;
set have codes(obs=0);
if _n_=1 then do;
declare hash h(dataset:'codes');
rc=h.definekey('dxcode');
rc=h.definedata('disease1','disease2','disease3','disease4');
rc=h.definedone();
end;
array dx dxcode1-dxcode3 ;
any=0;
do index=1 to dim(dx);
if not h.find(key:dx[index]) then do;
any=1;
output;
end;
end;
if not any then do;
array flags disease1-disease4;
do index=1 to dim(flags);
flags[index]=0;
end;
output;
end;
drop rc index dx:;
run;
proc summary data=match;
by patientid;
var date any disease: ;
output out=want max= ;
run;
Results:
Patient Obs ID _TYPE_ _FREQ_ Date any Disease1 Disease2 Disease3 Disease4 1 1 0 4 2017-05-01 1 0 0 1 0 2 2 0 2 2017-05-01 1 1 0 0 0
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.