When the ID is in the same group, I want to keep the minimum start date and maximum end date. Is there a better way to do this? The retain statement in the code below does not get me the desired results but it might be my conditional logic that isn't getting me what I want...
data have;
input
ID$ group$ flag$ start_date:DATE9. end_date:DATE9. ;
format start_date MMDDYY10. end_date MMDDYY10.;
datalines;
22 f 1 17NOV2022 04JAN2023
19 e 1 01JUL2022 29JUL2022
19 e 2 21NOV2022 04APR2023
03 f 1 01JUL2022 07JUL2022
33 r 1 01JUL2022 12JUL2022
34 t 1 01JUL2022 23MAR2023
46 f 1 22NOV2022 19MAR2023
76 e 1 26AUG2022 15MAY2023
81 a 1 01OCT2022 10APR2023
69 b 1 30DEC2022 19JAN2023
70 c 1 01JUL2022 01NOV2022
70 c 2 11NOV2022 10JAN2023
70 c 3 01FEB2023 30APR2023
70 e 1 01JUL2022 21AUG2022
72 e 1 09MAR2023 30JUN2023
;;;
run;
proc sort data=have;
by ID group start_date end_date;
run;
data want;
set have;
by ID group start_date end_date;
retain first_start_date last_end_date;
if Id=lag(id) and group=lag(group) then do;
first_start_date=first.start_date;
last_end_date=last.end_date;
end;
format last_end_date mmddyy10. first_start_date mmddyy10.;
run;
proc print data=want noobs;
run;
This is what I am trying to do:
One way:
proc sort data=have; by ID group start_date end_date; run; proc summary data=have; by id group; var start_date end_date; output out=want (drop=_type_ _freq_) min(start_date)= max(end_date)=; run;
Dates are numeric. So the earliest date is the smallest numeric value or MIN, latest is MAX.
If any of the dates are missing this would return the min or max of the non-missing values.
One way:
proc sort data=have; by ID group start_date end_date; run; proc summary data=have; by id group; var start_date end_date; output out=want (drop=_type_ _freq_) min(start_date)= max(end_date)=; run;
Dates are numeric. So the earliest date is the smallest numeric value or MIN, latest is MAX.
If any of the dates are missing this would return the min or max of the non-missing values.
This worked, thank you @ballardw !
I was going to suggest a DATA step after the proc SORT, but I like @ballardw 's response better, because it uses a PROC specifically intended for this type of task.
But there is no need to pre-sort the data to use PROC SUMMARY in this case. Instead of a
BY ID GROUP;
statement, use a
CLASS ID GROUP;
statement. It will track each id/group combination no matter what the original data order, yet nevertheless will output the data sorted by ID/GROUP.
data have;
input
ID$ group$ flag start_date:DATE9. end_date:DATE9. ;
format start_date MMDDYY10. end_date MMDDYY10.;
datalines;
22 f 1 17NOV2022 04JAN2023
19 e 1 01JUL2022 29JUL2022
19 e 2 21NOV2022 04APR2023
03 f 1 01JUL2022 07JUL2022
33 r 1 01JUL2022 12JUL2022
34 t 1 01JUL2022 23MAR2023
46 f 1 22NOV2022 19MAR2023
76 e 1 26AUG2022 15MAY2023
81 a 1 01OCT2022 10APR2023
69 b 1 30DEC2022 19JAN2023
70 c 1 01JUL2022 01NOV2022
70 c 2 11NOV2022 10JAN2023
70 c 3 01FEB2023 30APR2023
70 e 1 01JUL2022 21AUG2022
72 e 1 09MAR2023 30JUN2023
run;
proc summary data=have noprint nway;
class id group;
var start_date end_date ;
output out=want (drop=_:) min(start_date)= max(end_date)= ;
run;
@mkeintz wrote:
I was going to suggest a DATA step after the proc SORT, but I like @ballardw 's response better, because it uses a PROC specifically intended for this type of task.
And it is also a good idea to use PROC SUMMARY because it is a very good tool that everyone should learn; it will be useful in a bazillion future situations.
Do you just want the min/max per group (so one observation per group)? From the output it looks like that is what you want, but your example data step does not have anything to enforce that.
Why are you not taking advantage of the FIRST.GROUP and LAST.GROUP boolean flags to detect when a new group starts or ends? Why are you assigning the FIRST.START_DATE and LAST.END_DATE boolean flags to your new variables and then formatting them as DATE values? If you treat a 0/1 boolean value as if it was a DATE value then 0 is 01JAN1960 and 1 is 02JAN1960 .
data want;
set have;
by ID group ;
retain first_start_date last_end_date;
format last_end_date first_start_date mmddyy10.;
if first.group then call missing(first_start_date,last_end_date);
first_start_date = min(first_start_date,start_date);
last_end_date = max(last_end_date,end_date);
if last.group;
keep id group first_start_date last_end_date;
run;
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.