Hello,
I have a SAS query that has been giving me trouble for quite some time (I am using SAS 9.4). I hope that the SAS community user groups can help.
I have a data set that contains ID, Location, start date, end date and the difference between the first end date and the next end date. For the first record of each ID the date_diff is missing beacause the date_diff is calcualated uniquley for each ID. The date set looks something like this:
ID | LOCATION | start_date | end_date | date_diff |
1 | Nevada | 12/3/2013 | 12/3/2013 | . |
1 | Nevada | 12/16/2013 | 12/16/2013 | 13 |
1 | Nevada | 12/24/2013 | 12/24/2013 | 8 |
1 | Nevada | 1/2/2014 | 1/2/2014 | 9 |
1 | Nevada | 1/20/2014 | 1/20/2014 | 18 |
1 | Nevada | 6/3/2014 | 6/3/2014 | 134 |
1 | Nevada | 7/22/2014 | 7/22/2014 | 49 |
2 | Arkansas | 10/14/2014 | 10/14/2014 | . |
2 | Arkansas | 10/28/2014 | 10/28/2014 | 14 |
2 | Arkansas | 12/9/2014 | 12/9/2014 | 42 |
2 | Arkansas | 12/16/2014 | 12/16/2014 | 7 |
3 | Arkansas | 12/17/2014 | 12/17/2014 | . |
4 | Seattle | 11/21/2013 | 11/27/2013 | . |
4 | Seattle | 11/21/2013 | 12/5/2013 | 8 |
4 | Seattle | 11/21/2013 | 12/10/2013 | 5 |
4 | Seattle | 11/21/2013 | 12/11/2013 | 1 |
4 | Seattle | 11/21/2013 | 12/14/2013 | 3 |
4 | Seattle | 3/21/2014 | 4/21/2014 | 128 |
4 | Seattle | 3/21/2014 | 4/23/2014 | 2 |
5 | New York City | 1/15/2014 | 1/15/2014 | . |
5 | Toranto | 2/10/2014 | 2/10/2014 | 26 |
5 | Toranto | 2/24/2014 | 2/24/2014 | 14 |
5 | Toranto | 2/25/2014 | 2/25/2014 | 1 |
5 | Nevada | 5/1/2014 | 5/1/2014 | 65 |
5 | Los Angeles | 5/28/2014 | 5/28/2014 | 27 |
What I would like to do is group by ID and then Location the first and last dates, however if the date difference is more than 30 then I would like that record to be in another group, also as I group I would like to have a counter that would let me know how many rows have been collapsed to create one group. So I would like my final table to look like this:
ID | LOCATION | start_dt1 | end_dt1 | Count |
1 | Nevada | 12/3/2013 | 1/20/2014 | 5 |
1 | Nevada | 6/3/2014 | 6/3/2014 | 1 |
1 | Nevada | 7/22/2014 | 7/22/2014 | 1 |
2 | Arkansas | 10/14/2014 | 10/28/2014 | 2 |
2 | Arkansas | 12/9/2014 | 12/16/2014 | 2 |
3 | Arkansas | 12/17/2014 | 12/17/2014 | 1 |
4 | Seattle | 11/21/2013 | 12/14/2014 | 5 |
4 | Seattle | 3/21/2014 | 4/23/2014 | 2 |
5 | New York City | 1/15/2014 | 1/15/2014 | 1 |
5 | Toranto | 2/10/2014 | 2/25/2014 | 3 |
5 | Nevada | 5/1/2014 | 5/1/2014 | 1 |
5 | Los Angeles | 5/28/2014 | 5/28/2014 | 1 |
I am able to do first. last. processing where I get some of the above table but I cannot seem to be able to break out the groups if the date_diff is more than 30.
If any one can assist in this problem I would greatly appreciate it!
Thank you and anxiously awaiting a reply!
I think this captures it:
data want;
count = 0;
do until(last.location);
set have; by id location notsorted;
if date_diff > 30 or first.location then do;
if count > 0 then output;
count = 0;
start_dt1 = start_date;
end;
end_dt1 = end_date;
count + 1;
end;
output;
keep id location start_dt1 end_dt1 count;
format start_dt1 end_dt1 yymmdd10.;
run;
Are you using an explicit output? That will be the key. The logic is relatively simple:
1. Create a new set of variables to hold output. Retain to next line - look at RETAIN statement.
2. At each line check your requirements.
2a. If Same ID and diff < 30 days then update the information in variables
2b. Else OUTPUT;
2c. Now reset all variables and these are retained to next line
Go back to step 1.
I think this captures it:
data want;
count = 0;
do until(last.location);
set have; by id location notsorted;
if date_diff > 30 or first.location then do;
if count > 0 then output;
count = 0;
start_dt1 = start_date;
end;
end_dt1 = end_date;
count + 1;
end;
output;
keep id location start_dt1 end_dt1 count;
format start_dt1 end_dt1 yymmdd10.;
run;
PGStats This is fantastic! excatly what I was looking for.
I am going to run this on my bigger dataset, (about 11 mil) I hope that there are no other anomolies that will come up!
To further my learning, is it possible to explain the concept behind this method?
what is the purpose of having the count and the "do until" before the set statement? does it matter?
how does this simple line of code get the complicated output I needed.
I was doing a retain statment with many first. last. processing as mentioned by Reeza but I did not get to my output as easily as PGstats code shows.
If you are able and willing and I would appreciate it if you can explain what the code is doing in the background.
Greatly Appreciate it ! and I greatly appreciate your quick response.
Using the code block
DO UNTIL(last.location);
SET ....; BY .... location;
....
END;
is a way to bypass the implicit looping of the data step within a data group (location here). The advantage exploited here is that variables (count, start_dt1, end_dt1) keep their values from one iteration to the next, i.e. within the data group, but are reset to missing by the data step at the start of each new group.
This coding trick is often called DOW. You will find many examples on this Forum and elsewhere.
Thank you so much for your reply!
This is was very useful and helpful!
data have; infile cards dlm='09'x truncover; input ID LOCATION : $20. start_date : mmddyy10. end_date : mmddyy10. date_diff; format start_date end_date mmddyy10.; cards; 1 Nevada 12/3/2013 12/3/2013 . 1 Nevada 12/16/2013 12/16/2013 13 1 Nevada 12/24/2013 12/24/2013 8 1 Nevada 1/2/2014 1/2/2014 9 1 Nevada 1/20/2014 1/20/2014 18 1 Nevada 6/3/2014 6/3/2014 134 1 Nevada 7/22/2014 7/22/2014 49 2 Arkansas 10/14/2014 10/14/2014 . 2 Arkansas 10/28/2014 10/28/2014 14 2 Arkansas 12/9/2014 12/9/2014 42 2 Arkansas 12/16/2014 12/16/2014 7 3 Arkansas 12/17/2014 12/17/2014 . 4 Seattle 11/21/2013 11/27/2013 . 4 Seattle 11/21/2013 12/5/2013 8 4 Seattle 11/21/2013 12/10/2013 5 4 Seattle 11/21/2013 12/11/2013 1 4 Seattle 11/21/2013 12/14/2013 3 4 Seattle 3/21/2014 4/21/2014 128 4 Seattle 3/21/2014 4/23/2014 2 5 New York City 1/15/2014 1/15/2014 . 5 Toranto 2/10/2014 2/10/2014 26 5 Toranto 2/24/2014 2/24/2014 14 5 Toranto 2/25/2014 2/25/2014 1 5 Nevada 5/1/2014 5/1/2014 65 5 Los Angeles 5/28/2014 5/28/2014 27 ; run; data have; set have; by id location notsorted; if first.location then do;sum=0;group+1;end; if sum gt 30 then do; group+1; sum=0; end; sum+date_diff; drop sum; run; data want; set have; by group; retain start; if first.group then do;start=start_date;n=0;end; n+1; if last.group then do;end=end_date;output;end; format start end mmddyy10.; drop group start_date end_date date_diff; run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.