Solved: How to process by process (first. last.) with in multiple groups -- ne...

sas_student1 · Posted 09-08-2016 10:07 PM

Hello,

I have a SAS query that has been giving me trouble for quite some time (I am using SAS 9.4). I hope that the SAS community user groups can help.

I have a data set that contains ID, Location, start date, end date and the difference between the first end date and the next end date. For the first record of each ID the date_diff is missing beacause the date_diff is calcualated uniquley for each ID. The date set looks something like this:

ID	LOCATION	start_date	end_date	date_diff
1	Nevada	12/3/2013	12/3/2013	.
1	Nevada	12/16/2013	12/16/2013	13
1	Nevada	12/24/2013	12/24/2013	8
1	Nevada	1/2/2014	1/2/2014	9
1	Nevada	1/20/2014	1/20/2014	18
1	Nevada	6/3/2014	6/3/2014	134
1	Nevada	7/22/2014	7/22/2014	49
2	Arkansas	10/14/2014	10/14/2014	.
2	Arkansas	10/28/2014	10/28/2014	14
2	Arkansas	12/9/2014	12/9/2014	42
2	Arkansas	12/16/2014	12/16/2014	7
3	Arkansas	12/17/2014	12/17/2014	.
4	Seattle	11/21/2013	11/27/2013	.
4	Seattle	11/21/2013	12/5/2013	8
4	Seattle	11/21/2013	12/10/2013	5
4	Seattle	11/21/2013	12/11/2013	1
4	Seattle	11/21/2013	12/14/2013	3
4	Seattle	3/21/2014	4/21/2014	128
4	Seattle	3/21/2014	4/23/2014	2
5	New York City	1/15/2014	1/15/2014	.
5	Toranto	2/10/2014	2/10/2014	26
5	Toranto	2/24/2014	2/24/2014	14
5	Toranto	2/25/2014	2/25/2014	1
5	Nevada	5/1/2014	5/1/2014	65
5	Los Angeles	5/28/2014	5/28/2014	27

What I would like to do is group by ID and then Location the first and last dates, however if the date difference is more than 30 then I would like that record to be in another group, also as I group I would like to have a counter that would let me know how many rows have been collapsed to create one group. So I would like my final table to look like this:

ID	LOCATION	start_dt1	end_dt1	Count
1	Nevada	12/3/2013	1/20/2014	5
1	Nevada	6/3/2014	6/3/2014	1
1	Nevada	7/22/2014	7/22/2014	1
2	Arkansas	10/14/2014	10/28/2014	2
2	Arkansas	12/9/2014	12/16/2014	2
3	Arkansas	12/17/2014	12/17/2014	1
4	Seattle	11/21/2013	12/14/2014	5
4	Seattle	3/21/2014	4/23/2014	2
5	New York City	1/15/2014	1/15/2014	1
5	Toranto	2/10/2014	2/25/2014	3
5	Nevada	5/1/2014	5/1/2014	1
5	Los Angeles	5/28/2014	5/28/2014	1

I am able to do first. last. processing where I get some of the above table but I cannot seem to be able to break out the groups if the date_diff is more than 30.

If any one can assist in this problem I would greatly appreciate it!

Thank you and anxiously awaiting a reply!

PGStats · Posted 09-08-2016 10:54 PM

I think this captures it:

data want;
count = 0;
do until(last.location);
    set have; by id location notsorted;
    if date_diff > 30 or first.location then do;
        if count > 0 then output;
        count = 0;
        start_dt1 = start_date;
        end;
    end_dt1 = end_date;
    count + 1;
    end;
output;
keep id location start_dt1 end_dt1 count;
format start_dt1 end_dt1 yymmdd10.;
run;

PG

View solution in original post

Reeza · Posted 09-08-2016 10:34 PM

Are you using an explicit output? That will be the key. The logic is relatively simple:

1. Create a new set of variables to hold output. Retain to next line - look at RETAIN statement.

2. At each line check your requirements.

2a. If Same ID and diff < 30 days then update the information in variables

2b. Else OUTPUT;

2c. Now reset all variables and these are retained to next line

Go back to step 1.

PGStats · Posted 09-08-2016 10:54 PM

I think this captures it:

data want;
count = 0;
do until(last.location);
    set have; by id location notsorted;
    if date_diff > 30 or first.location then do;
        if count > 0 then output;
        count = 0;
        start_dt1 = start_date;
        end;
    end_dt1 = end_date;
    count + 1;
    end;
output;
keep id location start_dt1 end_dt1 count;
format start_dt1 end_dt1 yymmdd10.;
run;

PG

sas_student1 · Posted 09-09-2016 12:04 AM

PGStats This is fantastic! excatly what I was looking for.

I am going to run this on my bigger dataset, (about 11 mil) I hope that there are no other anomolies that will come up!

To further my learning, is it possible to explain the concept behind this method?

what is the purpose of having the count and the "do until" before the set statement? does it matter?

how does this simple line of code get the complicated output I needed.

I was doing a retain statment with many first. last. processing as mentioned by Reeza but I did not get to my output as easily as PGstats code shows.

If you are able and willing and I would appreciate it if you can explain what the code is doing in the background.

Greatly Appreciate it ! and I greatly appreciate your quick response.

PGStats · Posted 09-22-2016 05:33 PM

Using the code block

DO UNTIL(last.location);
SET ....; BY .... location;
....
END;

is a way to bypass the implicit looping of the data step within a data group (location here). The advantage exploited here is that variables (count, start_dt1, end_dt1) keep their values from one iteration to the next, i.e. within the data group, but are reset to missing by the data step at the start of each new group.

This coding trick is often called DOW. You will find many examples on this Forum and elsewhere.

PG

sas_student1 · Posted 10-24-2016 09:01 PM

Thank you so much for your reply!

This is was very useful and helpful!

Ksharp · Posted 09-09-2016 01:27 AM

data have;
infile cards dlm='09'x  truncover;
input ID	LOCATION : $20.	start_date : mmddyy10.	end_date : mmddyy10. date_diff;
format start_date end_date mmddyy10.;
cards;
1	Nevada	12/3/2013	12/3/2013	.
1	Nevada	12/16/2013	12/16/2013	13
1	Nevada	12/24/2013	12/24/2013	8
1	Nevada	1/2/2014	1/2/2014	9
1	Nevada	1/20/2014	1/20/2014	18
1	Nevada	6/3/2014	6/3/2014	134
1	Nevada	7/22/2014	7/22/2014	49
2	Arkansas	10/14/2014	10/14/2014	.
2	Arkansas	10/28/2014	10/28/2014	14
2	Arkansas	12/9/2014	12/9/2014	42
2	Arkansas	12/16/2014	12/16/2014	7
3	Arkansas	12/17/2014	12/17/2014	.
4	Seattle	11/21/2013	11/27/2013	.
4	Seattle	11/21/2013	12/5/2013	8
4	Seattle	11/21/2013	12/10/2013	5
4	Seattle	11/21/2013	12/11/2013	1
4	Seattle	11/21/2013	12/14/2013	3
4	Seattle	3/21/2014	4/21/2014	128
4	Seattle	3/21/2014	4/23/2014	2
5	New York City	1/15/2014	1/15/2014	.
5	Toranto	2/10/2014	2/10/2014	26
5	Toranto	2/24/2014	2/24/2014	14
5	Toranto	2/25/2014	2/25/2014	1
5	Nevada	5/1/2014	5/1/2014	65
5	Los Angeles	5/28/2014	5/28/2014	27
;
run;
data have;
 set have;
 by id location notsorted;
 if first.location then do;sum=0;group+1;end;
 if sum gt 30 then do;
  group+1;
  sum=0;
 end;
 sum+date_diff;
 drop sum;
run;
data want;
 set have;
 by group;
 retain start;
 if first.group then do;start=start_date;n=0;end;
 n+1;
 if last.group then do;end=end_date;output;end;
 format start end mmddyy10.;
 drop group start_date end_date date_diff; 
run;

How to process by process (first. last.) with in multiple groups -- need help!

Re: How to process by process (first. last.) with in multiple groups -- need help!

Re: How to process by process (first. last.) with in multiple groups -- need help!

Re: How to process by process (first. last.) with in multiple groups -- need help!

Re: How to process by process (first. last.) with in multiple groups -- need help!

Re: How to process by process (first. last.) with in multiple groups -- need help!

Re: How to process by process (first. last.) with in multiple groups -- need help!

Re: How to process by process (first. last.) with in multiple groups -- need help!

SAS Innovate 2025: Call for Content

Classroom Training Available!