Hello, I am working with a large data set involving patient IDs (n=~14,000) and diagnosis codes (~100 per patient). The data is in "long" format sorted by patient ID- please see example below. My goal is to create a variable to indicate whether each patient has a certain set of diagnosis codes.
patient_id diagnosis_code
1 aaa
1 bbb
1 xxx
1 ddd
2 aaa
2 ddd
2 eee
2 zzz
... ...
I am very new to SAS and my initial thought was to transpose the data and then create the variable from the transposed data set. However, this approach did not work. Please see my code below. Any guidance on how to approach this problem would be much appreciated.
proc sort data=have
out=sorted;
by patient_id;
run;
proc transpose data=sorted
out=transposed;
by patient_id;
var diagnosis_code;
run;
data transposed2;
set transposed;
if diagnosis_code in ('aaa','bbb') then vara=1;
else vara=0;
run;
A few suggestions, mostly for readability, to a small extent for speed:
data want;
set have;
by patient_id;
if first.patient_id then do;
vara=0;
varb=0;
varc=0;
end;
retain vara varb varc;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
else if diagnosis_code in ('ccc', 'ddd') then varb=1;
else if diagnosis_code in ('zzz', 'eee') then varc=1;
if last.patient_id;
run;
The small speed improvements: adding ELSE to the list of IF/THEN possibilities to avoid re-checking once a match has been found, and checking for first.patient_id just once per observation instead of for each variable.
Depending on the size of your data, you may or may not be able to measure the speed differences.
You're right. Transposing the data is unnecessary and just makes things difficult. Sketch out what you would like the result to look like, and you will receive multiple suggestions about a viable way to approach this using the original data.
Here's one approach, but it depends on what you really want as the result. After sorting:
data want;
set have;
by patient_id;
if first.patient_id then vara=0;
retain vara;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
if last.patient_id;
run;
Thank you, this produced exactly what was needed. I am actually interested in creating multiple (non-mutually exclusive) variables using the same logic. I would assume that I would use the code below to build on what was provided in the previous post. Would this be the most effective way of doing this?
data want;
set have;
by patient_id;
if first.patient_id then vara=0;
retain vara;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
if first.patient_id then varb=0;
retain varb;
if diagnosis_code in ('ccc', 'ddd') then varb=1;
.... etc.
if last.patient_id;
run;
One possible approach:
data have;
input patient_id diagnosis_code$;
datalines;
1 aaa
1 bbb
1 xxx
1 ddd
2 aaa
2 ddd
2 eee
2 zzz
;
proc sort data=have
out=sorted;
by patient_id;
run;
proc transpose data=sorted
out=transposed(drop=_name_);
by patient_id;
var diagnosis_code;
id diagnosis_code;
run;
data transposed2;
set transposed;
/* Var_ab : patient has both aaa and bbb diagnoses */
var_ab = cmiss(aaa, bbb) = 0;
run;
proc print; run;
patient_ Obs id aaa bbb xxx ddd eee zzz var_ab 1 1 aaa bbb xxx ddd 1 2 2 aaa ddd eee zzz 0
Thank you for the suggestion; however, given the size of my data set, transposing the data seems to take a large amount of processing time. As a result, it appears that using a first.variable and last.variable approach may be a little more efficient in my case.
In terms of creating multiple variables in the same data step using the first.variable and last.variable approach that @Astounding suggested, what is an effective/efficient way of going about this? Would it be the same code mentioned above but with the "if last.patient_id" statement inserted at the end of the last variable? For example, see below. Is there a better way of doing this?
data want;
set have;
by patient_id;
if first.patient_id then vara=0;
retain vara;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
if first.patient_id then varb=0;
retain varb;
if diagnosis_code in ('ccc', 'ddd') then varb=1;
if first.patient_id then varc=0;
retain varc;
if diagnosis_code in ('zzz', 'eee') then varc=1;
.... etc.
if last.patient_id;
run;
A few suggestions, mostly for readability, to a small extent for speed:
data want;
set have;
by patient_id;
if first.patient_id then do;
vara=0;
varb=0;
varc=0;
end;
retain vara varb varc;
if diagnosis_code in ('aaa', 'bbb') then vara=1;
else if diagnosis_code in ('ccc', 'ddd') then varb=1;
else if diagnosis_code in ('zzz', 'eee') then varc=1;
if last.patient_id;
run;
The small speed improvements: adding ELSE to the list of IF/THEN possibilities to avoid re-checking once a match has been found, and checking for first.patient_id just once per observation instead of for each variable.
Depending on the size of your data, you may or may not be able to measure the speed differences.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.