I have a large claims databases with four diagnosis codes per claim. The data looks like this:
ID CLAIM_ID DATE DX1 DX2 DX3 DX4
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 7039
3 107 6/1/2015 7039
3 108 6/1/2015 E884 0930 1092 0930
3 109 7/1/2015 3094
data have;
infile datalines dlm="," missover;
input id claim_id date $ dx1 $ dx2 $ dx3 $ dx4 $;
datalines;
1,100,1/1/2015,7804,,
1,101,1/1/2015,30921,39021,,
1,102,2/1/2015,30943,902,01920,
1,103,3/1/2015,011,,,
2,104,4/1/2015,4530,,,
2,105,5/1/2015,V9090,,,
3,106,6/1/2015,7039,,,
3,107,6/1/2015,7039,,,
3,108,6/1/2015,E884,0930,1092,0930
3,109,7/1/2015,3094,,,
;
run;
I want to create a dataset that has all of the *unique* diagnosis codes that occurred on claims during the same day for a given individual, as below (I don't care about whether it was DX1, DX2, DX3, or DX4). This would involve using proc transpose somehow to create a series of variables that I might call "DXS1-DXS5" (in this fake example, five variables would be created but it would in reality be way more). I've been playing with the syntax of proc transpose and can't get this work, though.
ID CLAIM_ID DATE DX1 DX2 DX3 DX4 DXS1 DXS2 DXS3 DXS4 DXS5
1 100 1/1/2015 7804 7804 30921 39021
1 101 1/1/2015 30921 39021 7804 30921 39021
1 102 2/1/2015 30943 902 01920 30943 902 01920
1 103 3/1/2015 011 011
2 104 4/1/2015 4530 4530
2 105 5/1/2015 V9090 V9090
3 106 6/1/2015 7039 7039 E884 0930 1092 0930
3 107 6/1/2015 7039 7039 E884 0930 1092 0930
3 108 6/1/2015 E884 0930 1092 0930 7039 E884 0930 1092 0930
3 109 7/1/2015 3094 3094
I'd appreciate suggestions on how to proceed. Thanks.
I would recommend that you first create a data set of all the DXS values, and then merge it back in. Here is one way (assuming DX1-DX4 are character):
data dxs;
set have;
array dx {4};
do i=1 to 4;
if dx{i} > ' ' then do;
diagnosis = dx{i};
output;
end;
end;
keep ID diagnosis date;
run;
proc sort data=dxs nodupkey;
by id date diagnosis;
run;
At this point, you can transpose the unique diagnosis values back into a single observation per ID, with as many DXS variables as needed:
proc transpose data=dxs prefix='dxs' out=dxs_per_id (drop=_name_);
var diagnosis;
by id date;
run;
Then all that is left to do is to merge it back in. Assuming your original data set is sorted:
data want;
merge have dxs_per_id;
by id date;
run;
The code is untested, but should be fine. Write back if it causes any problems.
Good luck.
You need to calculate the daily separately and then merge it back in.
data have;
infile datalines dlm="," truncover;
input id claim_id date $ dx1 $ dx2 $ dx3 $ dx4 $;
datalines;
1,100,1/1/2015,7804,,
1,101,1/1/2015,30921,39021,,
1,102,2/1/2015,30943,902,01920,
1,103,3/1/2015,011,,,
2,104,4/1/2015,4530,,,
2,105,5/1/2015,V9090,,,
3,106,6/1/2015,7039,,,
3,107,6/1/2015,7039,,,
3,108,6/1/2015,E884,0930,1092,0930
3,109,7/1/2015,3094,,,
;
run;
proc transpose data=have out=flipped(rename=col1=dx);
by id claim_id date;
var dx1-dx4;
run;
proc sort data=flipped nodupkey out=flipped_unique (where=(not missing(dx)) drop=claim_id);
by id date dx;
run;
data id;
set flipped_unique;
by id date;
if first.date then count=1; else count+1;
name=cats("dxs", count);
run;
proc transpose data=id out=wide_daily(drop=_name_);
by id date;
id name;
var dx;
run;
proc sql;
create table want as
select a.*, b.*
from have as a
left join wide_daily as b
on a.id=b.id and a.date=b.date;
quit;
I would recommend that you first create a data set of all the DXS values, and then merge it back in. Here is one way (assuming DX1-DX4 are character):
data dxs;
set have;
array dx {4};
do i=1 to 4;
if dx{i} > ' ' then do;
diagnosis = dx{i};
output;
end;
end;
keep ID diagnosis date;
run;
proc sort data=dxs nodupkey;
by id date diagnosis;
run;
At this point, you can transpose the unique diagnosis values back into a single observation per ID, with as many DXS variables as needed:
proc transpose data=dxs prefix='dxs' out=dxs_per_id (drop=_name_);
var diagnosis;
by id date;
run;
Then all that is left to do is to merge it back in. Assuming your original data set is sorted:
data want;
merge have dxs_per_id;
by id date;
run;
The code is untested, but should be fine. Write back if it causes any problems.
Good luck.
Thank you. This works.
One last question - how would I create a global macro variable that has the number of DXS diagnoses in the file?
There are a few ways ... here's one that meshes well with your program:
data _null_;
set dxs_per_id;
array dxs {*} dxs:;
call symputx('n_dxs_vars', dim(dxs));
stop;
run;
I think the best place to put this would be right after PROC TRANSPOSE creates DXS_PER_ID. It would be important to call the variable that you are transposing DIAGNOSIS (or at least any name that does not begin with DXS). The list DXS: is actually all variable names that begin with "DXS".
Thanks so much.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.