How to summarize dataset by Min.and Max date value by Group

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

How to summarize dataset by Min.and Max date value by Group

Hi, I want to track changes of a person's status and find their minimum start and maximum end dates for each change. The raw datatable below is a sample where you can see for person A the minimum start and maximum end date for their Main status are 01Jan15 to 01Feb16. So for Row 1 and 2 there's no change in status and since EndDt for row 1 is 06July15 and StartDt for row 2 is 07July15 there's no break in the period, and the two rows should be aggregated into one row start on 01Jan15 and end on 01Feb16.

 

The 3rd row should be kept because there's a break in the period even though status stays in Main, StartDt is 01Oct16 and EndDt is 31Dec16.

 

The 4th and 5th row should be aggregated into one row that reads status in Co and StartDt is 01Jan17 and EndDt is 01Sep17.

 

The 6th row should be kept unchanged start from 30Oct17 to 05Nov17.

The desired reduced datatable is copied below the raw datatable

 

Raw datatable

RowIDStatusStartDtEndDt
1AMain01-Jan-1506-Jul-15
2AMain07-Jul-1501-Feb-16
3AMain01-Oct-1631-Dec-16
4ACo01-Jan-1707-Jul-17
5ACo08-Jul-1701-Sep-17
6ACo30-Oct-1705-Nov-17

Desired reduced datatable

IDStatusMinStartDtMaxEndDt
AMain01-Jan-1501-Feb-16
AMain01-Oct-1631-Dec-16
ACo01-Jan-1701-Sep-17
ACo30-Oct-17

05-Nov-17

This problem has been bugging me for days, I appreciate any suggestions I could get. Also I am new to this, so could you please give more details in your answers?

 

Many Thanks!

Michelle


Accepted Solutions
Solution
‎11-30-2017 02:42 PM
Super User
Posts: 10,313

Re: How to summarize dataset by Min.and Max date value by Group

data have;
infile cards expandtabs truncover;
input Row	ID $	Status $	StartDt : date9.	EndDt : date9.;
format StartDt EndDt date9.;
cards;
1	A	Main	01-Jan-15	06-Jul-15
2	A	Main	07-Jul-15	01-Feb-16
3	A	Main	01-Oct-16	31-Dec-16
4	A	Co	01-Jan-17	07-Jul-17
5	A	Co	08-Jul-17	01-Sep-17
6	A	Co	30-Oct-17	05-Nov-17
;
run;
data temp;
 set have;
 by id status notsorted;
 if first.status or startdt-lag(enddt) ne 1 then group+1;
run;
data want;
 set temp;
 by group;
 retain start .;
 if first.group then start=startdt;
 if last.group;
 format start date9.;
 drop row startdt;
run;

View solution in original post


All Replies
Trusted Advisor
Posts: 1,683

Re: How to summarize dataset by Min.and Max date value by Group

[ Edited ]

1) sort dataset by is status startDT.

2) use retain and lag function to save previous obs values of startDT endDT.

3) on first.status keep the retained startDT value into a new variable.

4) on last.status override the current startDT value with the new retained startDT and do output.

    else (any obs neither first nor last) - check is startDT > retained endDT +1.

         if positive do:

              -  output

              - override retained startDT with the current startDT.

 

cretae your code and come back if you have any issue.

Solution
‎11-30-2017 02:42 PM
Super User
Posts: 10,313

Re: How to summarize dataset by Min.and Max date value by Group

data have;
infile cards expandtabs truncover;
input Row	ID $	Status $	StartDt : date9.	EndDt : date9.;
format StartDt EndDt date9.;
cards;
1	A	Main	01-Jan-15	06-Jul-15
2	A	Main	07-Jul-15	01-Feb-16
3	A	Main	01-Oct-16	31-Dec-16
4	A	Co	01-Jan-17	07-Jul-17
5	A	Co	08-Jul-17	01-Sep-17
6	A	Co	30-Oct-17	05-Nov-17
;
run;
data temp;
 set have;
 by id status notsorted;
 if first.status or startdt-lag(enddt) ne 1 then group+1;
run;
data want;
 set temp;
 by group;
 retain start .;
 if first.group then start=startdt;
 if last.group;
 format start date9.;
 drop row startdt;
run;
Occasional Contributor
Posts: 5

Re: How to summarize dataset by Min.and Max date value by Group

I understand this now. Thankyou!
Respected Advisor
Posts: 2,161

Re: How to summarize dataset by Min.and Max date value by Group

[ Edited ]

If you create a flag to indicate where the breaks in time occur in your data set, then PROC SUMMARY will do the rest.

 

Something like (untested code)

 

data want;
    set have;
    by id status;
    prev_enddt=lag(enddt);
    if first.status or prev_enddt^=(startdt-1) then flag+1;
run;

proc summary data = want nway;
    class id status flag;
    var startdt enddt;
    output out=want2 min(startdt)=min_startdt max(enddt)=max_enddt;
run;

 

 

--
Paige Miller
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 221 views
  • 1 like
  • 4 in conversation