BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Whitlea
Obsidian | Level 7

When the ID is in the same group, I want to keep the minimum start date and maximum end date.  Is there a better way to do this? The retain statement in the code below does not get me the desired results but it might be my conditional logic that isn't getting me what I want...

data have;
input
ID$	group$ flag$ start_date:DATE9.	end_date:DATE9.	 ;
format start_date MMDDYY10. end_date MMDDYY10.;
datalines;
22 	f	1 17NOV2022 04JAN2023 
19	e 	1 01JUL2022 29JUL2022 
19	e 	2 21NOV2022 04APR2023 
03	f  	1 01JUL2022 07JUL2022 
33	r	1 01JUL2022 12JUL2022 
34	t 	1 01JUL2022 23MAR2023 
46	f	1 22NOV2022 19MAR2023 
76	e	1 26AUG2022 15MAY2023 
81	a	1 01OCT2022 10APR2023 
69 	b	1 30DEC2022 19JAN2023 
70 	c	1 01JUL2022 01NOV2022 
70 	c	2 11NOV2022 10JAN2023 
70  c	3 01FEB2023 30APR2023 
70 	e	1 01JUL2022 21AUG2022 
72 	e	1 09MAR2023 30JUN2023 

;;;
run;

proc sort data=have;
by ID group start_date end_date;
run;

data want;
set have;
by ID group start_date end_date;

retain first_start_date  last_end_date;
if Id=lag(id) and group=lag(group) then do;
first_start_date=first.start_date;
last_end_date=last.end_date;

end;
format last_end_date mmddyy10. first_start_date mmddyy10.;
run; 

proc print data=want noobs;
run; 

This is what I am trying to do:

Capture.PNG

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

One way:

proc sort data=have;
by ID group start_date end_date;
run;

proc summary data=have;
   by id group;
   var start_date end_date;
   output out=want (drop=_type_ _freq_) min(start_date)= max(end_date)=;
run;

Dates are numeric. So the earliest date is the smallest numeric value or MIN, latest is MAX.

If any of the dates are missing this would return the min or max of the non-missing values.

View solution in original post

5 REPLIES 5
ballardw
Super User

One way:

proc sort data=have;
by ID group start_date end_date;
run;

proc summary data=have;
   by id group;
   var start_date end_date;
   output out=want (drop=_type_ _freq_) min(start_date)= max(end_date)=;
run;

Dates are numeric. So the earliest date is the smallest numeric value or MIN, latest is MAX.

If any of the dates are missing this would return the min or max of the non-missing values.

mkeintz
PROC Star

I was going to suggest a DATA step after the proc SORT, but I like @ballardw 's response better, because it uses a PROC specifically intended for this type of task.

 

But there is no need to pre-sort the data to use PROC SUMMARY in this case.  Instead of a   

BY ID GROUP;

statement, use a 

 

  CLASS ID GROUP;

statement.  It will track each id/group combination no matter what the original data order, yet nevertheless will output the data sorted by ID/GROUP.

 

data have;
input
ID$	group$ flag start_date:DATE9.	end_date:DATE9.	 ;
format start_date MMDDYY10. end_date MMDDYY10.;
datalines;
22 	f	1 17NOV2022 04JAN2023 
19	e 	1 01JUL2022 29JUL2022 
19	e 	2 21NOV2022 04APR2023 
03	f  	1 01JUL2022 07JUL2022 
33	r	1 01JUL2022 12JUL2022 
34	t 	1 01JUL2022 23MAR2023 
46	f	1 22NOV2022 19MAR2023 
76	e	1 26AUG2022 15MAY2023 
81	a	1 01OCT2022 10APR2023 
69 	b	1 30DEC2022 19JAN2023 
70 	c	1 01JUL2022 01NOV2022 
70 	c	2 11NOV2022 10JAN2023 
70  c	3 01FEB2023 30APR2023 
70 	e	1 01JUL2022 21AUG2022 
72 	e	1 09MAR2023 30JUN2023 
run;

proc summary data=have noprint nway;
  class id group;
  var start_date end_date ;
  output out=want (drop=_:) min(start_date)= max(end_date)= ;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PaigeMiller
Diamond | Level 26

@mkeintz wrote:

I was going to suggest a DATA step after the proc SORT, but I like @ballardw 's response better, because it uses a PROC specifically intended for this type of task.


And it is also a good idea to use PROC SUMMARY because it is a very good tool that everyone should learn; it will be useful in a bazillion future situations.

--
Paige Miller
Tom
Super User Tom
Super User

Do you just want the min/max per group (so one observation per group)?  From the output it looks like that is what you want, but your example data step does not have anything to enforce that.

 

Why are you not taking advantage of the FIRST.GROUP and LAST.GROUP boolean flags to detect when a new group starts or ends? Why are you assigning the FIRST.START_DATE and LAST.END_DATE boolean flags to your new variables and then formatting them as DATE values?  If you treat a 0/1 boolean value as if it was a DATE value then 0 is 01JAN1960 and 1 is 02JAN1960 .

data want;
  set have;
  by ID group ;
  retain first_start_date  last_end_date;
  format last_end_date first_start_date mmddyy10.;
  if first.group then call missing(first_start_date,last_end_date);
  first_start_date = min(first_start_date,start_date);
  last_end_date = max(last_end_date,end_date); 
  if last.group;
  keep id group first_start_date last_end_date;
run; 

 

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1340 views
  • 3 likes
  • 5 in conversation