Hi 🙂
Data have:
ID admission_date discharge_date Diagnosis_code
1 1/1/2000 1/2/2000 A
1 1/2/2000 20/2/2000 B
1 1/2/2000 20/2/2000 C
1 1/2/2000 20/2/2000 D
I want to clean these up so that dates on a "loop" are collapsed (note the same admission date as the previous discharge date in the second row, and exact same dates for the remaining rows, I also want them to be considered as one loop if the next admission date is 1 day after previous discharge date). Each row has it's own unique diagnosis code and these are of interest, depending on what they are.
I want it to look like this:
Data want:
ID min_admission_date max_discharge_date Diag_code_new in_diag_code1 in_diag_code2
1 1/1/2000 20/2/2000 D A C
The code I tried so far gives me what I need except for the last two columns which I'm very interested in getting (in the same row)
The code I tried is this:
data want;
set have;
by ID admission_date;
length min_admission_date max_discharge_date 4.;
retain min_admission_date max_discharge_date diag_code_new;
if first.id then do;
max_discharge_date=discharge_date;
min_admission_date=admission_date;
diag_code_new=diagnosis_code;
end:
else do;
if admission_date> (1+max_discharge_date) then do;
output;
min_admission_date=admission_date;
end;
max_discharge_date=max(discharge_date, max_discharge_date);
diag_code_new=diagnosis_code;
end;
if last.id then do;
output;
end;
format min_admission_date max_discharge_date ddmmyy10.;
run;
Where can I modify to be able to get the last two columns if specific diagnosis codes show up during the loop clean up process?
Thanks!
OK, the results should look something like this then:
Program code is below. I added a SORT which may not be necessary if you can count on the data always being in ID, admission date, and discharge date order. I also slightly revised the SQL that grabs the max number of unique diagnoses for any one ID. Be sure include the new SQL.
The thing we need to do here before we mark this as "solved" is to run some data through that has records that do roll-up/consolidate and some that don't -- a larger sample size than I have now. I'm sure you're doing this already, but I like to be paranoid careful.
Jim
DATA Have;
INPUT ID $
_Admission_Date : DDMMYY10.
_Discharge_Date : DDMMYY10.
_Diagnosis_Code $
;
FORMAT _Admission_Date DDMMYYD10.;
FORMAT _Discharge_Date DDMMYYD10.;
DATALINES;
1 1/1/2000 1/2/2000 A
1 1/2/2000 20/2/2000 B
1 1/2/2000 20/2/2000 C
1 1/10/2010 2/10/2010 A
1 1/11/2010 2/11/2010 D
2 1/2/2002 1/2/2002 B
2 1/2/2005 20/2/2005 C
2 1/10/2010 2/10/2010 A
2 2/10/2010 2/11/2010 D
;
RUN;
**------------------------------------------------------------------------------**;
** Get the Max nbr of distinct codes for any one ID. **;
PROC SQL NOPRINT;
*PROC SQL;
SELECT MAX( Nbr_Of_Codes )
INTO : Nbr_Of_Codes
FROM (
SELECT ID
,COUNT(DISTINCT _Diagnosis_Code)
AS Nbr_Of_Codes
FROM Have
GROUP BY ID
)
;
QUIT;
%LET Nbr_Of_Codes = %SYSFUNC(COMPRESS(&Nbr_Of_Codes));
%PUT NOTE: &=Nbr_Of_Codes;
**------------------------------------------------------------------------------**;
PROC SORT DATA=Have;
BY ID _Admission_Date _Discharge_Date;
RUN;
**------------------------------------------------------------------------------**;
DATA Want;
DROP _:;
SET Have;
BY ID _Admission_Date _Discharge_Date;
LENGTH Min_Admission_Date Max_Discharge_Date 4;
LENGTH In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes $8;
ARRAY Diag [*] In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes;
retain Min_Admission_Date
Max_Discharge_Date
In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes
;
IF first.id then
do;
Max_Discharge_Date = _Discharge_Date;
Min_Admission_Date = _Admission_Date;
end;
else
do;
IF _Admission_Date > (1 + Max_Discharge_Date) then
do;
OUTPUT;
Min_Admission_Date = _Admission_Date;
CALL MISSING(of In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes);
end;
Max_Discharge_Date = MAX(_Discharge_Date, Max_Discharge_Date);
end;
IF _Diagnosis_Code NOT IN Diag THEN
DO;
DO _i = 1 TO &Nbr_Of_Codes;
IF MISSING(Diag[_i]) THEN
DO;
Diag[_i] = _Diagnosis_Code;
_i = &Nbr_Of_Codes;
END;
END;
END;
IF last.id then
do;
OUTPUT;
CALL MISSING(Min_Admission_Date, Max_Discharge_Date);
CALL MISSING(of In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes);
end;
format Min_Admission_Date Max_Discharge_Date DDMMYYD10.;
RUN;
I took your code and modified it a bit to hopefully give you what you want. I also added a little SQL routine to figure out the maximum number of distinct codes for any one ID -- a number I use to build and populate a retained array. I wasn't completely clear on the distinction between Diag_Code_New and the In_Diag_Codes, so I just put everything into In_Diag_Codes. Take a look at the code and the output and adjust to suit. I also wasn't clear on why diagnosis B was not included, so I (for now) just included all diagnoses. If you need to limit the diagnoses to just certain diagnoses, that's easy enough to do. I also added more data just so I had a better idea as to whether or not my code was doing what it was supposed to do.
Regards,
Jim
Here's the code and below that the results.
DATA Have;
INPUT ID $
admission_date : DDMMYY10.
discharge_date : DDMMYY10.
_Diagnosis_Code $
;
FORMAT Admission_Date DDMMYYD10.;
FORMAT Discharge_Date DDMMYYD10.;
DATALINES;
1 1/1/2000 1/2/2000 A
1 1/2/2000 20/2/2000 B
1 1/2/2000 20/2/2000 C
1 1/2/2000 20/2/2000 D
2 2/2/2020 2/2/2020 E
2 2/2/2020 2/2/2020 F
2 2/2/2020 2/2/2020 G
2 2/2/2020 2/2/2020 H
2 2/2/2020 2/2/2020 I
;
RUN;
**------------------------------------------------------------------------------**;
** Get the Max nbr of distinct codes for any one ID. **;
PROC SQL NOPRINT;
SELECT MAX( Nbr_Of_Codes )
INTO : Nbr_Of_Codes
FROM (
SELECT DISTINCT COUNT(_Diagnosis_Code)
AS Nbr_Of_Codes
FROM Have
GROUP BY ID)
;
QUIT;
%LET Nbr_Of_Codes = %SYSFUNC(COMPRESS(&Nbr_Of_Codes));
%PUT NOTE: &=Nbr_Of_Codes;
**------------------------------------------------------------------------------**;
DATA Want;
DROP _:;
SET Have;
BY ID admission_date;
length min_admission_date max_discharge_date 4;
LENGTH In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes $8;
ARRAY Diag [*] In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes;
retain min_admission_date
max_discharge_date
In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes
;
if _Diagnosis_Code NOT IN Diag THEN
DO;
DO _i = 1 TO &Nbr_Of_Codes;
IF MISSING(Diag[_i]) THEN
DO;
Diag[_i] = _Diagnosis_Code;
_i = &Nbr_Of_Codes;
END;
END;
END;
if first.id then
do;
max_discharge_date = discharge_date;
min_admission_date = admission_date;
end;
else
do;
if admission_date > (1 + max_discharge_date) then
do;
output;
min_admission_date = admission_date;
end;
max_discharge_date = max(discharge_date, max_discharge_date);
end;
if last.id then
do;
output;
CALL MISSING(min_admission_date, max_discharge_date);
CALL MISSING(of In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes);
end;
format min_admission_date max_discharge_date DDMMYYD10.;
run;
Results:
Hi @jimbarbour, I posted a reply earlier that seems to have been removed.
Many thanks for the code! Saved my life indeed.
I however get code columns that make it difficult to find which of them are specific to the admission. So to follow up on my data have, after coding, I get the following:
ID min_admission_date max_discharge_date In_diag_code1.....................in_diag_code7 in_diag_code8 in_diag_code9 in_diag_code10 in_diag_code11
1 Previous admission dates Code1 Code2 Code Code Code Code
1 Previous admission dates Code1 Code2 Code Code Code Code Code7
1 Previous admission dates Code1 Code2 Code Code Code Code Code7 A
1 1/1/2000 20/2/2000 Code1 Code2 Code Code Code Code Code7 A B C D
I'm interested in having rows (as per your output image), where each admission (after being cleaned up) lists all the codes that were specific to dates in that admission only, whereas now I can see codes from previous admissions of the same ID.
Where can I modify in the code?
Hi, @JJ_211,
I wasn't quite clear on exactly what a "loop" meant or which records should be collapsed and consolidated. I did know how to keep the diagnostic codes, a retained array, so I introduced the array and just used ID as the break point to clear the array and re-start accumulation.
This code:
CALL MISSING(of In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes);
clears the array.
Copy that line into whatever point(s) in the code determine(s) the break point in record consolidation, and the accumulation of diagnoses will start afresh.
Jim
Hi @jimbarbour
That makes sense. I have tried putting that call (missing) code in different places where I thought it should start afresh but with no luck. I am of course doing something wrong. Could you advise on where you would put it, if what you want is described as the following:
For each admission row in the original have dataset, there was a diagnosis code, and there are several admission rows for the same ID, where different diagnosis codes are registered. What I meant with loop is those admission and discharge dates which I end up collapsing (when a new min_admission_date and max_discharge_date are created). What I want is that when these admission rows are collapsed, all the diagnoses codes that were registered there appear as in_diag_code1... N (depending on how many).
So if one ID has 4 rows which were collapsed and end up appearing as one row in my want dataset, then the 4 diagnosis codes should appear in 4 columns.
Then the SAME ID would have another 3 rows which were also collapsed and appear as another row in my want dataset, then for that row I want the 3 diagnosis codes in 3 columns next to it (which are then not necessarily the same as the 4 rows above it).
As such, the same ID ends up having several rows in the want dataset which resulted after being collapsed, and their diagnosis codes are specific to each row (so the diagnosis codes that were part of all admissions which resulted in the one collapsed row in the want dataset).
I hope you can guide me, this is the final step I need for magic to finally happen 🙂
Janet
Janet, is there a way you can provide me with admission dates for a single ID that would not be within one day of the following admission date? I think that's the data that would really let me know if I've done the programming correctly. Ideally, they would be in the below format.
I could mock something up, but then we're testing my understanding, a concept most fraught with peril.
Jim
DATA Have;
INPUT ID $
admission_date : DDMMYY10.
discharge_date : DDMMYY10.
_Diagnosis_Code $
;
FORMAT Admission_Date DDMMYYD10.;
FORMAT Discharge_Date DDMMYYD10.;
DATALINES;
1 1/1/2000 1/2/2000 A
1 1/2/2000 20/2/2000 B
1 1/2/2000 20/2/2000 C
1 1/2/2000 20/2/2000 D
2 2/2/2020 2/2/2020 E
2 2/2/2020 2/2/2020 F
2 2/2/2020 2/2/2020 G
2 2/2/2020 2/2/2020 H
2 2/2/2020 2/2/2020 I
;
RUN;
Hi @jimbarbour ,
VERY thankful! Here I try to approximate the data structure I actually have, with easier to look at dates:
DATA Have;
INPUT ID $
admission_date : DDMMYY10.
discharge_date : DDMMYY10.
_Diagnosis_Code $
;
FORMAT Admission_Date DDMMYYD10.;
FORMAT Discharge_Date DDMMYYD10.;
DATALINES;
1 1/1/2000 1/2/2000 A
1 1/2/2000 20/2/2000 B
1 1/2/2000 20/2/2000 C
1 1/10/2010 2/10/2010 A
1 1/11/2010 2/11/2010 D
2 1/2/2002 1/2/2002 B
2 1/2/2005 20/2/2005 C
2 1/10/2010 2/10/2010 A
2 2/10/2010 2/11/2010 D
;
RUN;
If it ends up looking like this, then it is magical:
ID min_admission_date max_discharge_date In_diag_code1 In_diag_code2 In_diag_code3
1 1/1/2000 20/2/20000 A B C
1 1/10/2010 2/10/2010 A
1 1/11/2010 2/11/2010 D
2 1/2/2002 1/2/2002 B
2 1/2/2005 20/2/2005 C
2 1/10/2010 2/11/2010 A D
OK, the results should look something like this then:
Program code is below. I added a SORT which may not be necessary if you can count on the data always being in ID, admission date, and discharge date order. I also slightly revised the SQL that grabs the max number of unique diagnoses for any one ID. Be sure include the new SQL.
The thing we need to do here before we mark this as "solved" is to run some data through that has records that do roll-up/consolidate and some that don't -- a larger sample size than I have now. I'm sure you're doing this already, but I like to be paranoid careful.
Jim
DATA Have;
INPUT ID $
_Admission_Date : DDMMYY10.
_Discharge_Date : DDMMYY10.
_Diagnosis_Code $
;
FORMAT _Admission_Date DDMMYYD10.;
FORMAT _Discharge_Date DDMMYYD10.;
DATALINES;
1 1/1/2000 1/2/2000 A
1 1/2/2000 20/2/2000 B
1 1/2/2000 20/2/2000 C
1 1/10/2010 2/10/2010 A
1 1/11/2010 2/11/2010 D
2 1/2/2002 1/2/2002 B
2 1/2/2005 20/2/2005 C
2 1/10/2010 2/10/2010 A
2 2/10/2010 2/11/2010 D
;
RUN;
**------------------------------------------------------------------------------**;
** Get the Max nbr of distinct codes for any one ID. **;
PROC SQL NOPRINT;
*PROC SQL;
SELECT MAX( Nbr_Of_Codes )
INTO : Nbr_Of_Codes
FROM (
SELECT ID
,COUNT(DISTINCT _Diagnosis_Code)
AS Nbr_Of_Codes
FROM Have
GROUP BY ID
)
;
QUIT;
%LET Nbr_Of_Codes = %SYSFUNC(COMPRESS(&Nbr_Of_Codes));
%PUT NOTE: &=Nbr_Of_Codes;
**------------------------------------------------------------------------------**;
PROC SORT DATA=Have;
BY ID _Admission_Date _Discharge_Date;
RUN;
**------------------------------------------------------------------------------**;
DATA Want;
DROP _:;
SET Have;
BY ID _Admission_Date _Discharge_Date;
LENGTH Min_Admission_Date Max_Discharge_Date 4;
LENGTH In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes $8;
ARRAY Diag [*] In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes;
retain Min_Admission_Date
Max_Discharge_Date
In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes
;
IF first.id then
do;
Max_Discharge_Date = _Discharge_Date;
Min_Admission_Date = _Admission_Date;
end;
else
do;
IF _Admission_Date > (1 + Max_Discharge_Date) then
do;
OUTPUT;
Min_Admission_Date = _Admission_Date;
CALL MISSING(of In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes);
end;
Max_Discharge_Date = MAX(_Discharge_Date, Max_Discharge_Date);
end;
IF _Diagnosis_Code NOT IN Diag THEN
DO;
DO _i = 1 TO &Nbr_Of_Codes;
IF MISSING(Diag[_i]) THEN
DO;
Diag[_i] = _Diagnosis_Code;
_i = &Nbr_Of_Codes;
END;
END;
END;
IF last.id then
do;
OUTPUT;
CALL MISSING(Min_Admission_Date, Max_Discharge_Date);
CALL MISSING(of In_Diag_Code1 - In_Diag_Code&Nbr_Of_Codes);
end;
format Min_Admission_Date Max_Discharge_Date DDMMYYD10.;
RUN;
@jimbarbour Hi Jim!
I ran the code now on my some 30 million row dataset, and I looked at a few examples of before and after this code was applied. It looks like it IS doing all it's supposed to do, both for data that do roll ups and that don't (one rows and such), for those who had one row per ID and those who had multiple. Also looked at different combinations of codes and repetition of codes, but this DOES IT ALL 🙂 🙂
Works perfectly, you are definitely a life saver!
Many, many, MANY thanks to you!
Janet
You're welcome. Glad it worked out.
That retained array trick is a handy one to keep in your "back pocket" for situations like the one you're describing. Notice also the use of IN with an array reference (where I check to see if the diagnosis is already in the array). I don't have to check each position in the array. IN checks them all in a single reference.
Jim
You're welcome, Janet.
By the way:
The IN operator will tell you if a particular value is in an array without having to index one-by-one through the array.
The WHICHC function will tell you which position in the array (or list of variables) holds a particular value. Say you have the highly realistic list of ailments as in the below SAS code. WHICHC will find which ordinal position a given value is in. Given the dataset produced by the first Data step, the WHICHC function in the second Data step would give us the following results:
There are but three values in our dataset, but there are four rows in the output. Why? Well, I told WHICHC to look for a value related to but not present in our dataset, and, of course, it's not there. WHICHC returned a zero. This is a nice feature in many ways. For one, WHICHC can be used, just like the IN operator, to tell whether a value is present at all. For example:
IF WHICHC(Ailments[*]) THEN
DO;
...Code to be executed only if a value exists somewhere within the array...
END;
If a given value is not present in the array, WHICHC returns 0 which evaluates as False, and the code in the DO - END is skipped. Conversely, if the value is present, WHICHC returns some positive number, any of which evaluates as True, and the code in the DO - END is executed.
Lastly, WHICHC can be used even if there is no array reference by listing the individual variables, in this case, Ailment1 - Ailment&Nbr_Of_Codes. See the third Data step. The output from the second and third Data steps is identical. Which is the better method (array vs. list of variables?). If I'm going to use an array elsewhere anyway, then I'd probably use an array with the WHICHC. Otherwise, I might just go with the list of variables. I don't think there's a significant difference in terms of performance, and I have tested variable lists vs array references up to 500 million records without seeing a material performance gain or loss.
Nota bene: The word "of" is very important in functions. If you code Ailment1 - Ailment&Nbr_Of_Codes without the "of," then SAS will assume that you intend to do subtraction. Let's just say your results might be a wee tad different if you leave the "of" out.
Jim
%LET Nbr_Of_Codes = 3;
DATA Text_Ailments;
Ailment1 = 'Sick';
Ailment2 = 'Sicker';
Ailment3 = 'Sickest';
RUN;
DATA Coded_Ailments;
DROP Ail:;
SET Text_Ailments;
ARRAY Ailments [&Nbr_Of_Codes] $ Ailment1 - Ailment&Nbr_Of_Codes;
Code = WHICHC('Sick', of Ailments[*]);
OUTPUT;
Code = WHICHC('Sicker', of Ailments[*]);
OUTPUT;
Code = WHICHC('Sickest', of Ailments[*]);
OUTPUT;
Code = WHICHC('Sickly', of Ailments[*]);
OUTPUT;
RUN;
DATA Coded_Ailments2;
DROP Ail:;
SET Text_Ailments;
Code = WHICHC('Sick', of Ailment1 - Ailment&Nbr_Of_Codes);
OUTPUT;
Code = WHICHC('Sicker', of Ailment1 - Ailment&Nbr_Of_Codes);
OUTPUT;
Code = WHICHC('Sickest', of Ailment1 - Ailment&Nbr_Of_Codes);
OUTPUT;
Code = WHICHC('Sickly', of Ailment1 - Ailment&Nbr_Of_Codes);
OUTPUT;
RUN;
Hi Jim,
Brilliant! I note the "of, have made some mistakes with leaving that one out.
What if, say I have a HUGE list of "ailments", and I want to keep variables in the array only if they include any of the ailments in that huge list I have? Does WHICHC still work?
To relate- in the earlier code, we get as a result a number of columns per row which list the diagnosis codes (after collapse): in_diag_code1 - in_diag_code&nbr_of codes (array). I have a list of specific codes, and want to see whether any of the codes in the array are also in the list of the specific codes I'm interested in.. WHICHC would work?
Haven't gotten to that step yet, but nice to follow up on the array and look up, and hopefully avoid mistakes and desperation when I get on it 🙂
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.