05-23-2016 02:21 PM
I have a sample dataset shown below. Each ID has different dates. I want to choose the observations within 15day period from the oldest date for each ID. For the observations with only one date per ID should appear in the output. For example in the below dataset I have the dates sorted by ID and date.
I need the output as shown below.
In the above output dataset, For the first ID, the third observation is eliminated as the date difference between the oldest date and the third observation date was above 15 days (i.e 23 day gap). For the sixth ID, the date difference was above 15days so both the observations got deleted. Second and Fourth ID were not changed as it had only one date per ID.
Please guide me in coding this. Thank you in advance.
05-23-2016 03:51 PM - edited 05-23-2016 03:59 PM
Neither 1/2/2013 nor 1/3/2013 are within 15 days of 1/25/2013. So is your rule actually within 15 days or something else?
Note that 2/6/2014 and 2/8/2014 are also not within 15 of 3/8/2014 and 4/4/2013 is not within 15 of 6/5/2013.
Also for Id 1 and 3 you do not include the maximum valued date but for Id 2 you do. Is the rule to do that if there is only a single value?
05-23-2016 04:28 PM
05-23-2016 06:38 PM
Not all of your dates are dd/mm/yyyy
25 is not a valid month.
So the first step is going to make sure your data is what you think it is.
This may give a starting point. The two SQL steps could be combined but make sure they are working as intended. The first step gets the latest date for each ID. Then match to the dates within the 15 day window. Exercise for the interested reader to find the Ids with only one record and append to the result.
proc sql; create table work.start as select id, date from have group by id having date=Max(date); quit; proc sql; create table work.most as select b.id, b.date from work.start as a left join have as b on a.id = b.id and abs(a.date-b.date) lt 15; quit;
05-23-2016 09:11 PM
Sorry for the confusion Ballardw. That was my mistake. The date is actually in dd/mm/yyyy format.The first step should give the oldest date (first date) for each ID. I used min instead of max. The next sept should find the dates which are within the 15days from the oldest date. Unfortunately, the second code you suggested does not give me the actual results I wanted.
The output should satisty two conditions:
1. If there is only one date per each ID then keep in the output. (As in second and fourth ID)
2. If there are two or more dates per each ID the difference between the oldest date (i.e first date) and the other dates should not exceed 15 days.
For clear understanding, the same dataset is formatted as below.
So, the dataset should output only the conditions which are satified. The output should look like below.
05-23-2016 09:28 PM
data have; infile cards expandtabs truncover; input Date : mmddyy10. ID; format Date mmddyy10.; cards; 1/2/2013 1 1/3/2013 1 1/25/2013 1 4/25/2014 2 2/6/2014 3 2/8/2014 3 3/8/2014 3 5/7/2013 4 4/4/2013 5 4/4/2013 5 6/5/2013 5 5/9/2014 6 8/16/2014 6 ; run; data temp; set have; by id; retain temp; if first.id then do; temp=date;group=0;end; dif=date-temp; if dif gt 15 then group+1; drop temp dif; run; data want; set temp; by id group; if (first.id and last.id) or (not (first.group and last.group)); run;
05-24-2016 09:01 AM
Thanks for the solution. The code is working for the dataset I provided. But when tried with a bigger dataset this is not working. I think if we create a separate variable which shows the difference between the dates of first.id and last.id would work.
The rule is
Difference between the first observation of first ID and the other observations of the first ID should not exceed 15 days.