BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Chaupak
Obsidian | Level 7

Hi, I have a question related to date interval. I have a dataset in which each subject has multiple observations (different dates). I was asked to identify subjects who have at least two records with their date between 1 and 2 years. The sample data is shown below:

 

data WORK.SAMPLE;
  infile datalines dsd truncover;
  input id 1. date:YYMMDD10.;
  format date YYMMDD10.;
datalines;
1 2012-07-24
1 2015-04-09
1 2018-03-15
1 2020-01-16
2 2019-12-02
2 2020-04-27
2 2020-05-18
3 2017-10-31
3 2018-01-09
3 2018-12-10
3 2019-04-08
3 2019-09-11
3 2020-04-10
3 2020-04-10
3 2020-12-24
;;;;

For example, there are four dates for id 1, its third and fourth observations meet the requirement; id 2 doesn't meet the requirement because the largest time range is less than 1 year; id 3 meets the requirement (2017-10-31 to 2018-12-10, etc.). My question is, how can I identify two records that meet requirement from multiple records efficiently? The only way that I came up with is to compare all records one by one (e.g. 1st date vs 2nd date, 1st date vs 3rd date, 2nd date vs 3rd date), but I don't think it's the best way to do it

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

Normally, I advise against wide data sets, preferring long data sets, but it seems to me the easiest method here is to create a wide data set using PROC TRANSPOSE, and then use ARRAYs to do all possible comparisons within an ID. This code really doesn't adjust for leap years, if that's important then you can use the INTCK function.

 

proc transpose data=sample prefix=date out=sample_t;
    var date;
    by id;
run;

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        do j=(i+1) to dim(d);
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
        end;
    end;
    drop j i _name_;
run;

Adding, I'm sure its possible without a transpose, using DOW loops (as explained by @PeterClemmensen) or hash objects, but this seems pretty simple to me.

--
Paige Miller

View solution in original post

6 REPLIES 6
PaigeMiller
Diamond | Level 26

Is date range always computed from the first date for an ID? Or is date range computed from the difference between the date on a given row and the date on the row before it? Or something else?

--
Paige Miller
Chaupak
Obsidian | Level 7

Hi, the date range may be computed from any two records for a single subject. They are not necessarily compared with the earliest date or compared with the adjacent records. That's what I was most concerned about. 

PaigeMiller
Diamond | Level 26

Normally, I advise against wide data sets, preferring long data sets, but it seems to me the easiest method here is to create a wide data set using PROC TRANSPOSE, and then use ARRAYs to do all possible comparisons within an ID. This code really doesn't adjust for leap years, if that's important then you can use the INTCK function.

 

proc transpose data=sample prefix=date out=sample_t;
    var date;
    by id;
run;

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        do j=(i+1) to dim(d);
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
        end;
    end;
    drop j i _name_;
run;

Adding, I'm sure its possible without a transpose, using DOW loops (as explained by @PeterClemmensen) or hash objects, but this seems pretty simple to me.

--
Paige Miller
Chaupak
Obsidian | Level 7

Thank you so much for the sample solution! It worked for me when the dataset is small. However, my dataset is large and there are a few extreme values for dates (~1400 dates), which makes the number of variables so large and creates many missing values for dates in other subjects. I can exclude the subjects with extreme dates first, and run the loop separately. I wonder if there is any methods to make this process faster. Anyway, many thanks to your solution!

Chaupak
Obsidian | Level 7

Found a solution to skip missing values and improve the speed. Just add a LEAVE statement when we meet a missing value.

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        do j=(i+1) to dim(d);
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
            if d(j) = . then leave;
        end;
        if d(i) = . then leave;
    end;
    drop j i _name_;
run;
PaigeMiller
Diamond | Level 26

Wouldn't this be even faster?

 

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        if d(i) = . then leave;
        do j=(i+1) to dim(d);
            if d(j) = . then leave;
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
        end;
    end;
    drop j i _name_;
run;
--
Paige Miller

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1808 views
  • 2 likes
  • 2 in conversation