I am working with patient hospitalization data, which includes a series of diagnosis code flags for each hospitalization. In our study design, patients have different exposures during different periods of time (exposure_type). I want to summarize the hospitalizations within each exposure type for each patient by creating an 'ever diagnosed flag' for each of the diagnosis codes in the study. This 'ever diagnosed flag' will indicate if a patient ever had that diagnosis code associated with any of their visits during each exposure period. This is my code so far:
data want;
set have;
by patient_id exposure_type;
array oldflags {20} diag1-diag20;
array overallflg {20} everdiag1-everdiag20;
if first.patient_id and first.exposure_type then do i = 1;
overallflg[i] = 0;
end;
do i = 1 to 20;
overallflg[i] = max(oldflags[i]);
end;
keep patient_id exposure_type everdiag1-everdiag20;
run;
When I run this above code, the max value for each diagnosis flag isn't carrying over. everdiag1-everdiag20 are basically copies of the original flags, and the max value within each unique ID and exposure date is not retained. What am I missing? Thank you for your time.
If I understand you properly (and I'm not 100% sure that I do understand you properly), then this doesn't require arrays at all. This is a job for some procedure that can look at different rows for each patient_id and exposure_type, and find out where the values of 1 exist.
In other words, this is a job for PROC SUMMARY (the SUPERMAN of SAS data aggregation™).
data have;
input patient_id exposure_type hosp_dt :mmddyy10. diag1 diag2 diag3 diag4;
cards;
100100 1 01.01.1995 0 1 0 1
100100 1 06.15.1995 1 0 0 1
100100 2 10.01.1995 1 0 0 0
100100 2 12.15.1995 1 0 1 0
;
proc summary data=have nway;
class patient_id exposure_type;
var diag1-diag4;
output out=want(drop=_:) max=everdiag1-everdiag4;
run;
Now, this output data set named WANT has only one row per patient_id and exposure_type. You could merge it back into the original data set if you want multiple rows per patient_id and exposure_type, as shown in your desired output. If you want to do that, I leave that to you to figure out what the code is.
It would be very helpful if you could show us a portion of the actual data, or even fake data arranged to be in a similar arrangement to the actual data. And then it would also be very helpful if you showed us the incorrect output and showed us what the correct output ought to be.
Certainly!
An example of what the data currently look like is below:
patient_id | exposure_type | hosp_dt | diag1 | diag2 | diag3 | diag4 |
100100 | 1 | 01.01.1995 | 0 | 1 | 0 | 1 |
100100 | 1 | 06.15.1995 | 1 | 0 | 0 | 1 |
100100 | 2 | 10.01.1995 | 1 | 0 | 0 | 0 |
100100 | 2 | 12.15.1995 | 1 | 0 | 1 | 0 |
Right now, my results look like a copy of the original file:
patient_id | exposure_type | everdiag1 | everdiag2 | everdiag3 | everdiag4 |
100100 | 1 | 0 | 1 | 0 | 1 |
100100 | 1 | 1 | 0 | 0 | 1 |
100100 | 2 | 1 | 0 | 0 | 0 |
100100 | 2 | 1 | 0 | 1 | 0 |
For patient 100100, the everdiag for diag1 should be 1 for both exposure types 1 and 2, as at least one hospital visit had that diagnosis associated with it during each exposure type. However, the everdiag flag for diag3 should only be 1 for only exposure type 2, as this diagnosis only appeared during a hospitalization when the patient was in exposure type 2.
See below for how I would like the final dataset to look like. All the everdiag flags should look the same within each unique patient ID and exposure type:
patient_id | exposure_type | everdiag1 | everdiag2 | everdiag3 | everdiag4 |
100100 | 1 | 1 | 1 | 0 | 1 |
100100 | 1 | 1 | 1 | 0 | 1 |
100100 | 2 | 1 | 0 | 1 | 0 |
100100 | 2 | 1 | 0 | 1 | 0 |
If I understand you properly (and I'm not 100% sure that I do understand you properly), then this doesn't require arrays at all. This is a job for some procedure that can look at different rows for each patient_id and exposure_type, and find out where the values of 1 exist.
In other words, this is a job for PROC SUMMARY (the SUPERMAN of SAS data aggregation™).
data have;
input patient_id exposure_type hosp_dt :mmddyy10. diag1 diag2 diag3 diag4;
cards;
100100 1 01.01.1995 0 1 0 1
100100 1 06.15.1995 1 0 0 1
100100 2 10.01.1995 1 0 0 0
100100 2 12.15.1995 1 0 1 0
;
proc summary data=have nway;
class patient_id exposure_type;
var diag1-diag4;
output out=want(drop=_:) max=everdiag1-everdiag4;
run;
Now, this output data set named WANT has only one row per patient_id and exposure_type. You could merge it back into the original data set if you want multiple rows per patient_id and exposure_type, as shown in your desired output. If you want to do that, I leave that to you to figure out what the code is.
I would expect to see comparing if a value you have matches some other values, such as list of actual codes. I don't see any such comparison.
And when you use
do i = 1 to 20; overallflg[i] = max(oldflags[i]); end;
this is exactly the same as
do i = 1 to 20; overallflg[i] = oldflags[i]; end;
because you are asking for the Max of exactly one value, oldflags[i];
If you want to use the largest value of an array, which doesn't make much sense to me in terms of your described problem the code would be
do i = 1 to 20; overallflg[i] = max( of oldflags(*)); end;
but that sets all of the overallflg to the same value. Which is why I don't think that is what you actually want either.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.