BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas_student1
Quartz | Level 8

Hello,

 

I have a SAS query that has been giving me trouble for quite some time (I am using SAS 9.4). I hope that the SAS community user groups can help.

 

I have a data set that contains ID, Location, start date, end date and the difference between the first end date and the next end date. For the first record of each ID the date_diff is missing beacause the date_diff is calcualated uniquley for each ID. The date set looks something like this:

 

IDLOCATIONstart_dateend_datedate_diff
1Nevada12/3/201312/3/2013.
1Nevada12/16/201312/16/201313
1Nevada12/24/201312/24/20138
1Nevada1/2/20141/2/20149
1Nevada1/20/20141/20/201418
1Nevada6/3/20146/3/2014134
1Nevada7/22/20147/22/201449
2Arkansas10/14/201410/14/2014.
2Arkansas10/28/201410/28/201414
2Arkansas12/9/201412/9/201442
2Arkansas12/16/201412/16/20147
3Arkansas12/17/201412/17/2014.
4Seattle11/21/201311/27/2013.
4Seattle11/21/201312/5/20138
4Seattle11/21/201312/10/20135
4Seattle11/21/201312/11/20131
4Seattle11/21/201312/14/20133
4Seattle3/21/20144/21/2014128
4Seattle3/21/20144/23/20142
5New York City1/15/20141/15/2014.
5Toranto2/10/20142/10/201426
5Toranto2/24/20142/24/201414
5Toranto2/25/20142/25/20141
5Nevada5/1/20145/1/201465
5Los Angeles5/28/20145/28/201427

 

What I would like to do is group by ID and then Location the first and last dates, however if the date difference is more than 30 then I would like that record to be in another group, also as I group I would like to have a counter that would let me know how many rows have been collapsed to create one group. So I would like my final table to look like this:

 

 

IDLOCATIONstart_dt1end_dt1Count
1Nevada12/3/20131/20/20145
1Nevada6/3/20146/3/20141
1Nevada7/22/20147/22/20141
2Arkansas10/14/201410/28/20142
2Arkansas12/9/201412/16/20142
3Arkansas12/17/201412/17/20141
4Seattle11/21/201312/14/20145
4Seattle3/21/20144/23/20142
5New York City1/15/20141/15/20141
5Toranto2/10/20142/25/20143
5Nevada5/1/20145/1/20141
5Los Angeles5/28/20145/28/20141

 

I am able to do first. last. processing where I get some of the above table but I cannot seem to be able to break out the groups if the date_diff is more than 30.

 

If any one can assist in this problem I would greatly appreciate it!

 

Thank you and anxiously awaiting a reply!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I think this captures it:

 

data want;
count = 0;
do until(last.location);
    set have; by id location notsorted;
    if date_diff > 30 or first.location then do;
        if count > 0 then output;
        count = 0;
        start_dt1 = start_date;
        end;
    end_dt1 = end_date;
    count + 1;
    end;
output;
keep id location start_dt1 end_dt1 count;
format start_dt1 end_dt1 yymmdd10.;
run;
PG

View solution in original post

6 REPLIES 6
Reeza
Super User

Are you using an explicit output? That will be the key. The logic is relatively simple:

 

1. Create a new set of variables to hold output. Retain to next line - look at RETAIN statement. 

2. At each line check your requirements.

2a. If Same ID and diff < 30 days then update the information in variables

2b. Else OUTPUT; 

2c. Now reset all variables and these are retained to next line

Go back to step 1. 

 

 

PGStats
Opal | Level 21

I think this captures it:

 

data want;
count = 0;
do until(last.location);
    set have; by id location notsorted;
    if date_diff > 30 or first.location then do;
        if count > 0 then output;
        count = 0;
        start_dt1 = start_date;
        end;
    end_dt1 = end_date;
    count + 1;
    end;
output;
keep id location start_dt1 end_dt1 count;
format start_dt1 end_dt1 yymmdd10.;
run;
PG
sas_student1
Quartz | Level 8

PGStats This is fantastic! excatly what I was looking for.

 

I am going to run this on my bigger dataset, (about 11 mil) I hope that there are no other anomolies that will come up!

 

To further my learning, is it possible to explain the concept behind this method?

what is the purpose of having the count and the "do until" before the set statement? does it matter?

how does this simple line of code get the complicated output I needed.

I was doing a retain statment with many first. last. processing as mentioned by Reeza but I did not get to my output as easily as PGstats code shows.

 

If you are able and willing and I would appreciate it if you can explain what the code is doing in the background.

Greatly Appreciate it ! and I greatly appreciate your quick response.

 

 

PGStats
Opal | Level 21

Using the code block

 

DO UNTIL(last.location);
SET ....; BY .... location;
....
END;

 

is a way to bypass the implicit looping of the data step within a data group (location here). The advantage exploited here is that variables (count, start_dt1, end_dt1) keep their values from one iteration to the next, i.e. within the data group, but are reset to missing by the data step at the start of each new group.

 

This coding trick is often called DOW. You will find many examples on this Forum and elsewhere.

PG
sas_student1
Quartz | Level 8

Thank you so much for your reply!

This is was very useful and helpful!

Ksharp
Super User
data have;
infile cards dlm='09'x  truncover;
input ID	LOCATION : $20.	start_date : mmddyy10.	end_date : mmddyy10. date_diff;
format start_date end_date mmddyy10.;
cards;
1	Nevada	12/3/2013	12/3/2013	.
1	Nevada	12/16/2013	12/16/2013	13
1	Nevada	12/24/2013	12/24/2013	8
1	Nevada	1/2/2014	1/2/2014	9
1	Nevada	1/20/2014	1/20/2014	18
1	Nevada	6/3/2014	6/3/2014	134
1	Nevada	7/22/2014	7/22/2014	49
2	Arkansas	10/14/2014	10/14/2014	.
2	Arkansas	10/28/2014	10/28/2014	14
2	Arkansas	12/9/2014	12/9/2014	42
2	Arkansas	12/16/2014	12/16/2014	7
3	Arkansas	12/17/2014	12/17/2014	.
4	Seattle	11/21/2013	11/27/2013	.
4	Seattle	11/21/2013	12/5/2013	8
4	Seattle	11/21/2013	12/10/2013	5
4	Seattle	11/21/2013	12/11/2013	1
4	Seattle	11/21/2013	12/14/2013	3
4	Seattle	3/21/2014	4/21/2014	128
4	Seattle	3/21/2014	4/23/2014	2
5	New York City	1/15/2014	1/15/2014	.
5	Toranto	2/10/2014	2/10/2014	26
5	Toranto	2/24/2014	2/24/2014	14
5	Toranto	2/25/2014	2/25/2014	1
5	Nevada	5/1/2014	5/1/2014	65
5	Los Angeles	5/28/2014	5/28/2014	27
;
run;
data have;
 set have;
 by id location notsorted;
 if first.location then do;sum=0;group+1;end;
 if sum gt 30 then do;
  group+1;
  sum=0;
 end;
 sum+date_diff;
 drop sum;
run;
data want;
 set have;
 by group;
 retain start;
 if first.group then do;start=start_date;n=0;end;
 n+1;
 if last.group then do;end=end_date;output;end;
 format start end mmddyy10.;
 drop group start_date end_date date_diff; 
run;
 




sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1593 views
  • 3 likes
  • 4 in conversation