Re: Keeping highest value per ID for all values

GAL1986 · Posted 05-08-2019 04:54 PM

Hi Everyone,

Thanks in advance for the help! I have a dataset with 827,304 observations and over 170 yes(1)/no(0) variables and I am trying to get one line per baby, per visit type, for each of the 170 different diagnoses.

This is a sample of what I have

Baby_ID	Visit_ID	Type of visit	Dx_date	DX_Jaundice	DX_Zoster	DX_Sepsis	DX_Hepatitis	DX_Well_baby
123	1	Delivery	4/20/2019	1	0	0	0	0
123	1	Delivery	4/20/2019	0	0	1	0	0
123	3	Ever	5/8/2019	0	0	0	1	0
123	2	Postnatal	4/25/2019	0	1	0	0	0

And this is what I want the data to look like

Baby_ID	Visit_ID	Type of visit	Dx_date	DX_Jaundice	DX_Zoster	DX_Sepsis	DX_Hepatitis	DX_Well_baby
123	1	Delivery	4/20/2019	1	0	1	0	0
123	3	Ever	5/8/2019	0	0	0	1	0
123	2	Postnatal	4/25/2019	0	1	0	0	0

I would usually transpose the variables of interest to get one line per ID, but with over 170 variables, that doesn't seem feasible. And

I think I am going down the wrong track here, but I have tried to use RETAIN
data Test_retain;

set test;
by baby_id;
retain highest;
if first.baby_id then highest=.;
highest=max(highest,DX_Jaundice);
if last.baby_ID then output;
run;

and some SQL code

Proc sql;
Create table Max as
Select baby_ID, Type_of_visit, DX_date, visit_ID, max(DX_Baby_Hear_Screen_Fail) as DX_Baby_Hear_Screen_Fail,
max(DX_Baby_Single_Live) as DX_Baby_Single_Live,
max(DX_baby_Hep_Vaccination) as DX_baby_Hep_Vaccination
From test2
Group by visit_ID ;
Quit;

If someone could point me in the right direction it would be greatly appreciated!

Reeza · Posted 05-08-2019 05:36 PM

Use PROC MEANS instead. You're looking for the max of the diag variables and group by dates.
You can use short cut lists to refer to the diagnosis list.

https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

ballardw · Posted 05-08-2019 06:29 PM

A bit more code oriented approach to @Reeza's suggestion:

proc summary data=test nway;
   class baby_id visit_id typeofvisit ;
   var dx_jaundice -- dx_lastdxvariable;
   output out=want (drop=_type_ _freq_) max=;
run;

The above assumes that all of the dx variables you want are adjacent (sequential variable number order as reported by Proc contents) in the data set as implied by your post. The double dash -- above is used to indicate that property.

If your dx_date had started with different letters in the name could have used the dx_: list type to get all variables whose names start with dx_.

Assuming no visit_id crosses a date boundary that might not be an issue. If you need the date for later processing you could add the variable to the class statement BUT if there are two or more dates associated with a single visit_id this won't work properly as each date would be summarized separately.

GAL1986 · Posted 06-28-2019 05:56 PM

So sorry it took me so long to reply- THANK YOU!

This worked perfectly!!!

Keeping highest value per ID for all values