Hi, I have a question related to date interval. I have a dataset in which each subject has multiple observations (different dates). I was asked to identify subjects who have at least two records with their date between 1 and 2 years. The sample data is shown below:
data WORK.SAMPLE; infile datalines dsd truncover; input id 1. date:YYMMDD10.; format date YYMMDD10.; datalines; 1 2012-07-24 1 2015-04-09 1 2018-03-15 1 2020-01-16 2 2019-12-02 2 2020-04-27 2 2020-05-18 3 2017-10-31 3 2018-01-09 3 2018-12-10 3 2019-04-08 3 2019-09-11 3 2020-04-10 3 2020-04-10 3 2020-12-24 ;;;;
For example, there are four dates for id 1, its third and fourth observations meet the requirement; id 2 doesn't meet the requirement because the largest time range is less than 1 year; id 3 meets the requirement (2017-10-31 to 2018-12-10, etc.). My question is, how can I identify two records that meet requirement from multiple records efficiently? The only way that I came up with is to compare all records one by one (e.g. 1st date vs 2nd date, 1st date vs 3rd date, 2nd date vs 3rd date), but I don't think it's the best way to do it.
Thanks!
Normally, I advise against wide data sets, preferring long data sets, but it seems to me the easiest method here is to create a wide data set using PROC TRANSPOSE, and then use ARRAYs to do all possible comparisons within an ID. This code really doesn't adjust for leap years, if that's important then you can use the INTCK function.
proc transpose data=sample prefix=date out=sample_t;
var date;
by id;
run;
data want;
set sample_t;
array d date:;
count=0;
do i=1 to dim(d)-1;
do j=(i+1) to dim(d);
if 365<=abs(d(i)-d(j))<=730 then count=count+1;
end;
end;
drop j i _name_;
run;
Adding, I'm sure its possible without a transpose, using DOW loops (as explained by @PeterClemmensen) or hash objects, but this seems pretty simple to me.
Is date range always computed from the first date for an ID? Or is date range computed from the difference between the date on a given row and the date on the row before it? Or something else?
Hi, the date range may be computed from any two records for a single subject. They are not necessarily compared with the earliest date or compared with the adjacent records. That's what I was most concerned about.
Normally, I advise against wide data sets, preferring long data sets, but it seems to me the easiest method here is to create a wide data set using PROC TRANSPOSE, and then use ARRAYs to do all possible comparisons within an ID. This code really doesn't adjust for leap years, if that's important then you can use the INTCK function.
proc transpose data=sample prefix=date out=sample_t;
var date;
by id;
run;
data want;
set sample_t;
array d date:;
count=0;
do i=1 to dim(d)-1;
do j=(i+1) to dim(d);
if 365<=abs(d(i)-d(j))<=730 then count=count+1;
end;
end;
drop j i _name_;
run;
Adding, I'm sure its possible without a transpose, using DOW loops (as explained by @PeterClemmensen) or hash objects, but this seems pretty simple to me.
Thank you so much for the sample solution! It worked for me when the dataset is small. However, my dataset is large and there are a few extreme values for dates (~1400 dates), which makes the number of variables so large and creates many missing values for dates in other subjects. I can exclude the subjects with extreme dates first, and run the loop separately. I wonder if there is any methods to make this process faster. Anyway, many thanks to your solution!
Found a solution to skip missing values and improve the speed. Just add a LEAVE statement when we meet a missing value.
data want;
set sample_t;
array d date:;
count=0;
do i=1 to dim(d)-1;
do j=(i+1) to dim(d);
if 365<=abs(d(i)-d(j))<=730 then count=count+1;
if d(j) = . then leave;
end;
if d(i) = . then leave;
end;
drop j i _name_;
run;
Wouldn't this be even faster?
data want;
set sample_t;
array d date:;
count=0;
do i=1 to dim(d)-1;
if d(i) = . then leave;
do j=(i+1) to dim(d);
if d(j) = . then leave;
if 365<=abs(d(i)-d(j))<=730 then count=count+1;
end;
end;
drop j i _name_;
run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.