Solved: Several row observations to be collapsed (works) and marked for a spec...

JJ_211 · Posted 10-08-2020 04:31 PM

Hi 🙂

Data have:

ID admission_date discharge_date Diagnosis_code

1 1/1/2000 1/2/2000 A

1 1/2/2000 20/2/2000 B

1 1/2/2000 20/2/2000 C

1 1/2/2000 20/2/2000 D

I want to clean these up so that dates on a "loop" are collapsed (note the same admission date as the previous discharge date in the second row, and exact same dates for the remaining rows, I also want them to be considered as one loop if the next admission date is 1 day after previous discharge date). Each row has it's own unique diagnosis code and these are of interest, depending on what they are.

I want it to look like this:

Data want:

ID min_admission_date max_discharge_date Diag_code_new in_diag_code1 in_diag_code2

1 1/1/2000 20/2/2000 D A C

The code I tried so far gives me what I need except for the last two columns which I'm very interested in getting (in the same row)

The code I tried is this:

data want;

set have;

by ID admission_date;

length min_admission_date max_discharge_date 4.;

retain min_admission_date max_discharge_date diag_code_new;

if first.id then do;

max_discharge_date=discharge_date;

min_admission_date=admission_date;

diag_code_new=diagnosis_code;

end:

else do;

if admission_date> (1+max_discharge_date) then do;

output;

min_admission_date=admission_date;

end;

max_discharge_date=max(discharge_date, max_discharge_date);

diag_code_new=diagnosis_code;

end;

if last.id then do;

output;

end;

format min_admission_date max_discharge_date ddmmyy10.;

run;

Where can I modify to be able to get the last two columns if specific diagnosis codes show up during the loop clean up process?

Thanks!

jimbarbour · Posted 10-09-2020 06:11 PM

OK, the results should look something like this then:

Program code is below. I added a SORT which may not be necessary if you can count on the data always being in ID, admission date, and discharge date order. I also slightly revised the SQL that grabs the max number of unique diagnoses for any one ID. Be sure include the new SQL.

The thing we need to do here before we mark this as "solved" is to run some data through that has records that do roll-up/consolidate and some that don't -- a larger sample size than I have now. I'm sure you're doing this already, but I like to be ~~paranoid~~ careful.

Jim

DATA	Have;
	INPUT	ID				$
			_Admission_Date	:	DDMMYY10.
			_Discharge_Date	:	DDMMYY10.
			_Diagnosis_Code	$
			;

	FORMAT	_Admission_Date	DDMMYYD10.;
	FORMAT	_Discharge_Date	DDMMYYD10.;

DATALINES;
1       1/1/2000              1/2/2000                 A
1       1/2/2000              20/2/2000                B
1       1/2/2000              20/2/2000                C
1       1/10/2010             2/10/2010                A
1       1/11/2010             2/11/2010                D
2       1/2/2002              1/2/2002                 B
2       1/2/2005              20/2/2005                C
2       1/10/2010             2/10/2010                A
2       2/10/2010             2/11/2010                D
;
RUN;

**------------------------------------------------------------------------------**;

**	Get the Max nbr of distinct codes for any one ID.	**;
PROC	SQL	NOPRINT;
*PROC	SQL;
	SELECT	MAX(	Nbr_Of_Codes	)
		INTO	:	Nbr_Of_Codes
		FROM	(
			SELECT	ID
					,COUNT(DISTINCT	_Diagnosis_Code)
				AS		Nbr_Of_Codes
				FROM	Have
				GROUP	BY	ID
				)
				;
QUIT;

%LET	Nbr_Of_Codes	=	%SYSFUNC(COMPRESS(&Nbr_Of_Codes));
%PUT	NOTE:  &=Nbr_Of_Codes;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=Have;
	BY	ID	_Admission_Date	_Discharge_Date;
RUN;

**------------------------------------------------------------------------------**;

DATA	Want;
	DROP	_:;

	SET	Have; 
		BY	ID	_Admission_Date	_Discharge_Date;

	LENGTH	Min_Admission_Date	Max_Discharge_Date 			4;
	LENGTH	In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes	$8;

	ARRAY	Diag	[*]	In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes;

	retain	Min_Admission_Date 
			Max_Discharge_Date 
			In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes
			;

	IF first.id then 
		do;
			Max_Discharge_Date				=	_Discharge_Date;
			Min_Admission_Date				=	_Admission_Date;
		end;
	else
		do;
			IF	_Admission_Date				> 	(1 + Max_Discharge_Date) then 
				do;
					OUTPUT;
					Min_Admission_Date		=	_Admission_Date;
					CALL	MISSING(of In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes);
				end;
			Max_Discharge_Date 				= 	MAX(_Discharge_Date, Max_Discharge_Date);
		end;

	IF	_Diagnosis_Code					NOT	IN	Diag	THEN
		DO;
			DO	_i							=	1	TO	&Nbr_Of_Codes;
				IF	MISSING(Diag[_i])					THEN
					DO;
						Diag[_i]			=	_Diagnosis_Code;
						_i					=	&Nbr_Of_Codes;
					END;
			END;
		END;

	IF	last.id then 
		do;
			OUTPUT;
			CALL	MISSING(Min_Admission_Date, Max_Discharge_Date);
			CALL	MISSING(of In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes);
		end;

	format Min_Admission_Date Max_Discharge_Date DDMMYYD10.;
RUN;

View solution in original post

jimbarbour · Posted 10-08-2020 11:13 PM

@JJ_211,

I took your code and modified it a bit to hopefully give you what you want. I also added a little SQL routine to figure out the maximum number of distinct codes for any one ID -- a number I use to build and populate a retained array. I wasn't completely clear on the distinction between Diag_Code_New and the In_Diag_Codes, so I just put everything into In_Diag_Codes. Take a look at the code and the output and adjust to suit. I also wasn't clear on why diagnosis B was not included, so I (for now) just included all diagnoses. If you need to limit the diagnoses to just certain diagnoses, that's easy enough to do. I also added more data just so I had a better idea as to whether or not my code was doing what it was supposed to do.

Regards,

Jim

Here's the code and below that the results.

DATA	Have;
	INPUT	ID				$
			admission_date	:	DDMMYY10.
			discharge_date	:	DDMMYY10.
			_Diagnosis_Code	$
			;

	FORMAT	Admission_Date	DDMMYYD10.;
	FORMAT	Discharge_Date	DDMMYYD10.;

DATALINES;
1       1/1/2000              1/2/2000                 A
1       1/2/2000              20/2/2000                B
1       1/2/2000              20/2/2000                C
1       1/2/2000              20/2/2000                D
2       2/2/2020              2/2/2020                 E
2       2/2/2020              2/2/2020                 F
2       2/2/2020              2/2/2020                 G
2       2/2/2020              2/2/2020                 H
2       2/2/2020              2/2/2020                 I
;
RUN;

**------------------------------------------------------------------------------**;

**	Get the Max nbr of distinct codes for any one ID.	**;
PROC	SQL	NOPRINT;
	SELECT	MAX(	Nbr_Of_Codes	)
		INTO	:	Nbr_Of_Codes
		FROM	(
			SELECT	DISTINCT	COUNT(_Diagnosis_Code)
				AS		Nbr_Of_Codes
				FROM	Have
				GROUP	BY	ID)
				;
QUIT;

%LET	Nbr_Of_Codes	=	%SYSFUNC(COMPRESS(&Nbr_Of_Codes));
%PUT	NOTE:  &=Nbr_Of_Codes;

**------------------------------------------------------------------------------**;

DATA	Want;
	DROP	_:;

	SET	Have; 
		BY	ID	admission_date;

	length	min_admission_date max_discharge_date 			4;
	LENGTH	In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes	$8;

	ARRAY	Diag	[*]	In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes;

	retain	min_admission_date 
			max_discharge_date 
			In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes
			;

	if	_Diagnosis_Code					NOT	IN	Diag	THEN
		DO;
			DO	_i							=	1	TO	&Nbr_Of_Codes;
				IF	MISSING(Diag[_i])					THEN
					DO;
						Diag[_i]			=	_Diagnosis_Code;
						_i					=	&Nbr_Of_Codes;
					END;
			END;
		END;

	if first.id then 
		do;
			max_discharge_date				=	discharge_date;
			min_admission_date				=	admission_date;
		end;
	else
		do;
			if	admission_date				> 	(1 + max_discharge_date) then 
				do;
					output;
					min_admission_date		=	admission_date;
				end;
			max_discharge_date 				= 	max(discharge_date, max_discharge_date);
		end;

	if	last.id then 
		do;
			output;
			CALL	MISSING(min_admission_date, max_discharge_date);
			CALL	MISSING(of In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes);
		end;

	format min_admission_date max_discharge_date DDMMYYD10.;
run;

Results:

JJ_211 · Posted 10-09-2020 06:07 AM

Hi @jimbarbour, I posted a reply earlier that seems to have been removed.

Many thanks for the code! Saved my life indeed.

I however get code columns that make it difficult to find which of them are specific to the admission. So to follow up on my data have, after coding, I get the following:

ID min_admission_date max_discharge_date In_diag_code1.....................in_diag_code7 in_diag_code8 in_diag_code9 in_diag_code10 in_diag_code11

1 Previous admission dates Code1 Code2 Code Code Code Code

1 Previous admission dates Code1 Code2 Code Code Code Code Code7

1 Previous admission dates Code1 Code2 Code Code Code Code Code7 A

1 1/1/2000 20/2/2000 Code1 Code2 Code Code Code Code Code7 A B C D

I'm interested in having rows (as per your output image), where each admission (after being cleaned up) lists all the codes that were specific to dates in that admission only, whereas now I can see codes from previous admissions of the same ID.

Where can I modify in the code?

jimbarbour · Posted 10-09-2020 09:54 AM

Hi, @JJ_211,

I wasn't quite clear on exactly what a "loop" meant or which records should be collapsed and consolidated. I did know how to keep the diagnostic codes, a retained array, so I introduced the array and just used ID as the break point to clear the array and re-start accumulation.

This code:

			CALL	MISSING(of In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes);

clears the array.

Copy that line into whatever point(s) in the code determine(s) the break point in record consolidation, and the accumulation of diagnoses will start afresh.

Jim

JJ_211 · Posted 10-09-2020 03:10 PM

Hi @jimbarbour

That makes sense. I have tried putting that call (missing) code in different places where I thought it should start afresh but with no luck. I am of course doing something wrong. Could you advise on where you would put it, if what you want is described as the following:

For each admission row in the original have dataset, there was a diagnosis code, and there are several admission rows for the same ID, where different diagnosis codes are registered. What I meant with loop is those admission and discharge dates which I end up collapsing (when a new min_admission_date and max_discharge_date are created). What I want is that when these admission rows are collapsed, all the diagnoses codes that were registered there appear as in_diag_code1... N (depending on how many).

So if one ID has 4 rows which were collapsed and end up appearing as one row in my want dataset, then the 4 diagnosis codes should appear in 4 columns.

Then the SAME ID would have another 3 rows which were also collapsed and appear as another row in my want dataset, then for that row I want the 3 diagnosis codes in 3 columns next to it (which are then not necessarily the same as the 4 rows above it).

As such, the same ID ends up having several rows in the want dataset which resulted after being collapsed, and their diagnosis codes are specific to each row (so the diagnosis codes that were part of all admissions which resulted in the one collapsed row in the want dataset).

I hope you can guide me, this is the final step I need for magic to finally happen 🙂

Janet

jimbarbour · Posted 10-09-2020 04:54 PM

@JJ_211

Janet, is there a way you can provide me with admission dates for a single ID that would not be within one day of the following admission date? I think that's the data that would really let me know if I've done the programming correctly. Ideally, they would be in the below format.

I could mock something up, but then we're testing my understanding, a concept most fraught with peril.

Jim

DATA	Have;
	INPUT	ID				$
			admission_date	:	DDMMYY10.
			discharge_date	:	DDMMYY10.
			_Diagnosis_Code	$
			;

	FORMAT	Admission_Date	DDMMYYD10.;
	FORMAT	Discharge_Date	DDMMYYD10.;

DATALINES;
1       1/1/2000              1/2/2000                 A
1       1/2/2000              20/2/2000                B
1       1/2/2000              20/2/2000                C
1       1/2/2000              20/2/2000                D
2       2/2/2020              2/2/2020                 E
2       2/2/2020              2/2/2020                 F
2       2/2/2020              2/2/2020                 G
2       2/2/2020              2/2/2020                 H
2       2/2/2020              2/2/2020                 I
;
RUN;

JJ_211 · Posted 10-09-2020 05:25 PM

Hi @jimbarbour ,

VERY thankful! Here I try to approximate the data structure I actually have, with easier to look at dates:

DATA	Have;
	INPUT	ID				$
			admission_date	:	DDMMYY10.
			discharge_date	:	DDMMYY10.
			_Diagnosis_Code	$
			;

	FORMAT	Admission_Date	DDMMYYD10.;
	FORMAT	Discharge_Date	DDMMYYD10.;

DATALINES;
1       1/1/2000              1/2/2000                 A
1       1/2/2000              20/2/2000                B
1       1/2/2000              20/2/2000                C
1       1/10/2010             2/10/2010                A
1       1/11/2010             2/11/2010                D
2       1/2/2002              1/2/2002                 B
2       1/2/2005              20/2/2005                C
2       1/10/2010             2/10/2010                A
2       2/10/2010             2/11/2010                D
;
RUN;

If it ends up looking like this, then it is magical:

ID min_admission_date max_discharge_date In_diag_code1 In_diag_code2 In_diag_code3

1 1/1/2000 20/2/20000 A B C

1 1/10/2010 2/10/2010 A

1 1/11/2010 2/11/2010 D

2 1/2/2002 1/2/2002 B

2 1/2/2005 20/2/2005 C

2 1/10/2010 2/11/2010 A D

jimbarbour · Posted 10-09-2020 06:11 PM

OK, the results should look something like this then:

Program code is below. I added a SORT which may not be necessary if you can count on the data always being in ID, admission date, and discharge date order. I also slightly revised the SQL that grabs the max number of unique diagnoses for any one ID. Be sure include the new SQL.

The thing we need to do here before we mark this as "solved" is to run some data through that has records that do roll-up/consolidate and some that don't -- a larger sample size than I have now. I'm sure you're doing this already, but I like to be ~~paranoid~~ careful.

Jim

DATA	Have;
	INPUT	ID				$
			_Admission_Date	:	DDMMYY10.
			_Discharge_Date	:	DDMMYY10.
			_Diagnosis_Code	$
			;

	FORMAT	_Admission_Date	DDMMYYD10.;
	FORMAT	_Discharge_Date	DDMMYYD10.;

DATALINES;
1       1/1/2000              1/2/2000                 A
1       1/2/2000              20/2/2000                B
1       1/2/2000              20/2/2000                C
1       1/10/2010             2/10/2010                A
1       1/11/2010             2/11/2010                D
2       1/2/2002              1/2/2002                 B
2       1/2/2005              20/2/2005                C
2       1/10/2010             2/10/2010                A
2       2/10/2010             2/11/2010                D
;
RUN;

**------------------------------------------------------------------------------**;

**	Get the Max nbr of distinct codes for any one ID.	**;
PROC	SQL	NOPRINT;
*PROC	SQL;
	SELECT	MAX(	Nbr_Of_Codes	)
		INTO	:	Nbr_Of_Codes
		FROM	(
			SELECT	ID
					,COUNT(DISTINCT	_Diagnosis_Code)
				AS		Nbr_Of_Codes
				FROM	Have
				GROUP	BY	ID
				)
				;
QUIT;

%LET	Nbr_Of_Codes	=	%SYSFUNC(COMPRESS(&Nbr_Of_Codes));
%PUT	NOTE:  &=Nbr_Of_Codes;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=Have;
	BY	ID	_Admission_Date	_Discharge_Date;
RUN;

**------------------------------------------------------------------------------**;

DATA	Want;
	DROP	_:;

	SET	Have; 
		BY	ID	_Admission_Date	_Discharge_Date;

	LENGTH	Min_Admission_Date	Max_Discharge_Date 			4;
	LENGTH	In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes	$8;

	ARRAY	Diag	[*]	In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes;

	retain	Min_Admission_Date 
			Max_Discharge_Date 
			In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes
			;

	IF first.id then 
		do;
			Max_Discharge_Date				=	_Discharge_Date;
			Min_Admission_Date				=	_Admission_Date;
		end;
	else
		do;
			IF	_Admission_Date				> 	(1 + Max_Discharge_Date) then 
				do;
					OUTPUT;
					Min_Admission_Date		=	_Admission_Date;
					CALL	MISSING(of In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes);
				end;
			Max_Discharge_Date 				= 	MAX(_Discharge_Date, Max_Discharge_Date);
		end;

	IF	_Diagnosis_Code					NOT	IN	Diag	THEN
		DO;
			DO	_i							=	1	TO	&Nbr_Of_Codes;
				IF	MISSING(Diag[_i])					THEN
					DO;
						Diag[_i]			=	_Diagnosis_Code;
						_i					=	&Nbr_Of_Codes;
					END;
			END;
		END;

	IF	last.id then 
		do;
			OUTPUT;
			CALL	MISSING(Min_Admission_Date, Max_Discharge_Date);
			CALL	MISSING(of In_Diag_Code1	-	In_Diag_Code&Nbr_Of_Codes);
		end;

	format Min_Admission_Date Max_Discharge_Date DDMMYYD10.;
RUN;

JJ_211 · Posted 10-09-2020 06:31 PM

@jimbarbour Hi Jim!

I ran the code now on my some 30 million row dataset, and I looked at a few examples of before and after this code was applied. It looks like it IS doing all it's supposed to do, both for data that do roll ups and that don't (one rows and such), for those who had one row per ID and those who had multiple. Also looked at different combinations of codes and repetition of codes, but this DOES IT ALL 🙂 🙂

Works perfectly, you are definitely a life saver!

Many, many, MANY thanks to you!

Janet

jimbarbour · Posted 10-09-2020 06:49 PM

You're welcome. Glad it worked out.

That retained array trick is a handy one to keep in your "back pocket" for situations like the one you're describing. Notice also the use of IN with an array reference (where I check to see if the diagnosis is already in the array). I don't have to check each position in the array. IN checks them all in a single reference.

Jim

JJ_211 · Posted 10-12-2020 11:55 AM

Very useful tip for what I'm about to do next, looking up specific diagnoses in the array (ends up being hundreds of codes for the nbr_of_codes).

Many thanks again!!
Janet

jimbarbour · Posted 10-12-2020 02:29 PM

@JJ_211

You're welcome, Janet.

By the way:

The IN operator will tell you if a particular value is in an array without having to index one-by-one through the array.

The WHICHC function will tell you which position in the array (or list of variables) holds a particular value. Say you have the highly realistic list of ailments as in the below SAS code. WHICHC will find which ordinal position a given value is in. Given the dataset produced by the first Data step, the WHICHC function in the second Data step would give us the following results:

There are but three values in our dataset, but there are four rows in the output. Why? Well, I told WHICHC to look for a value related to but not present in our dataset, and, of course, it's not there. WHICHC returned a zero. This is a nice feature in many ways. For one, WHICHC can be used, just like the IN operator, to tell whether a value is present at all. For example:

IF WHICHC(Ailments[*]) THEN 
    DO;
          ...Code to be executed only if a value exists somewhere within the array...
    END;

If a given value is not present in the array, WHICHC returns 0 which evaluates as False, and the code in the DO - END is skipped. Conversely, if the value is present, WHICHC returns some positive number, any of which evaluates as True, and the code in the DO - END is executed.

Lastly, WHICHC can be used even if there is no array reference by listing the individual variables, in this case, Ailment1 - Ailment&Nbr_Of_Codes. See the third Data step. The output from the second and third Data steps is identical. Which is the better method (array vs. list of variables?). If I'm going to use an array elsewhere anyway, then I'd probably use an array with the WHICHC. Otherwise, I might just go with the list of variables. I don't think there's a significant difference in terms of performance, and I have tested variable lists vs array references up to 500 million records without seeing a material performance gain or loss.

Nota bene: The word "of" is very important in functions. If you code Ailment1 - Ailment&Nbr_Of_Codes without the "of," then SAS will assume that you intend to do subtraction. Let's just say your results might be a wee tad different if you leave the "of" out.

Jim

%LET	Nbr_Of_Codes	=	3;

DATA	Text_Ailments;
	Ailment1	=	'Sick';
	Ailment2	=	'Sicker';
	Ailment3	=	'Sickest';
RUN;

DATA	Coded_Ailments;
	DROP	Ail:;
	SET	Text_Ailments;
	ARRAY	Ailments	[&Nbr_Of_Codes]	$	Ailment1 - Ailment&Nbr_Of_Codes;

	Code	=	WHICHC('Sick', of Ailments[*]);
	OUTPUT;

	Code	=	WHICHC('Sicker', of Ailments[*]);
	OUTPUT;

	Code	=	WHICHC('Sickest', of Ailments[*]);
	OUTPUT;

	Code	=	WHICHC('Sickly', of Ailments[*]);
	OUTPUT;
RUN;

DATA	Coded_Ailments2;
	DROP	Ail:;
	SET	Text_Ailments;
	
	Code	=	WHICHC('Sick', of Ailment1 - Ailment&Nbr_Of_Codes);
	OUTPUT;

	Code	=	WHICHC('Sicker', of Ailment1 - Ailment&Nbr_Of_Codes);
	OUTPUT;

	Code	=	WHICHC('Sickest', of Ailment1 - Ailment&Nbr_Of_Codes);
	OUTPUT;

	Code	=	WHICHC('Sickly', of Ailment1 - Ailment&Nbr_Of_Codes);
	OUTPUT;
RUN;

JJ_211 · Posted 10-14-2020 08:19 AM

@jimbarbour

Hi Jim,

Brilliant! I note the "of, have made some mistakes with leaving that one out.

What if, say I have a HUGE list of "ailments", and I want to keep variables in the array only if they include any of the ailments in that huge list I have? Does WHICHC still work?

To relate- in the earlier code, we get as a result a number of columns per row which list the diagnosis codes (after collapse): in_diag_code1 - in_diag_code&nbr_of codes (array). I have a list of specific codes, and want to see whether any of the codes in the array are also in the list of the specific codes I'm interested in.. WHICHC would work?

Haven't gotten to that step yet, but nice to follow up on the array and look up, and hopefully avoid mistakes and desperation when I get on it 🙂

Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

Re: Several row observations to be collapsed (works) and marked for a specific variable (needed)

SAS Innovate 2025: Call for Content

Classroom Training Available!