Solved: Identify a specific date range among multiple observations of date

Chaupak · Posted 11-15-2021 01:40 PM

Hi, I have a question related to date interval. I have a dataset in which each subject has multiple observations (different dates). I was asked to identify subjects who have at least two records with their date between 1 and 2 years. The sample data is shown below:

data WORK.SAMPLE;
  infile datalines dsd truncover;
  input id 1. date:YYMMDD10.;
  format date YYMMDD10.;
datalines;
1 2012-07-24
1 2015-04-09
1 2018-03-15
1 2020-01-16
2 2019-12-02
2 2020-04-27
2 2020-05-18
3 2017-10-31
3 2018-01-09
3 2018-12-10
3 2019-04-08
3 2019-09-11
3 2020-04-10
3 2020-04-10
3 2020-12-24
;;;;

For example, there are four dates for id 1, its third and fourth observations meet the requirement; id 2 doesn't meet the requirement because the largest time range is less than 1 year; id 3 meets the requirement (2017-10-31 to 2018-12-10, etc.). My question is, how can I identify two records that meet requirement from multiple records efficiently? The only way that I came up with is to compare all records one by one (e.g. 1st date vs 2nd date, 1st date vs 3rd date, 2nd date vs 3rd date), but I don't think it's the best way to do it.

Thanks!

PaigeMiller · Posted 11-15-2021 03:11 PM

Normally, I advise against wide data sets, preferring long data sets, but it seems to me the easiest method here is to create a wide data set using PROC TRANSPOSE, and then use ARRAYs to do all possible comparisons within an ID. This code really doesn't adjust for leap years, if that's important then you can use the INTCK function.

proc transpose data=sample prefix=date out=sample_t;
    var date;
    by id;
run;

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        do j=(i+1) to dim(d);
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
        end;
    end;
    drop j i _name_;
run;

Adding, I'm sure its possible without a transpose, using DOW loops (as explained by @PeterClemmensen) or hash objects, but this seems pretty simple to me.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 11-15-2021 02:06 PM

Is date range always computed from the first date for an ID? Or is date range computed from the difference between the date on a given row and the date on the row before it? Or something else?

--
Paige Miller

Chaupak · Posted 11-15-2021 02:11 PM

Hi, the date range may be computed from any two records for a single subject. They are not necessarily compared with the earliest date or compared with the adjacent records. That's what I was most concerned about.

PaigeMiller · Posted 11-15-2021 03:11 PM

Normally, I advise against wide data sets, preferring long data sets, but it seems to me the easiest method here is to create a wide data set using PROC TRANSPOSE, and then use ARRAYs to do all possible comparisons within an ID. This code really doesn't adjust for leap years, if that's important then you can use the INTCK function.

proc transpose data=sample prefix=date out=sample_t;
    var date;
    by id;
run;

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        do j=(i+1) to dim(d);
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
        end;
    end;
    drop j i _name_;
run;

Adding, I'm sure its possible without a transpose, using DOW loops (as explained by @PeterClemmensen) or hash objects, but this seems pretty simple to me.

--
Paige Miller

Chaupak · Posted 11-15-2021 04:08 PM

Thank you so much for the sample solution! It worked for me when the dataset is small. However, my dataset is large and there are a few extreme values for dates (~1400 dates), which makes the number of variables so large and creates many missing values for dates in other subjects. I can exclude the subjects with extreme dates first, and run the loop separately. I wonder if there is any methods to make this process faster. Anyway, many thanks to your solution!

Chaupak · Posted 11-15-2021 04:27 PM

Found a solution to skip missing values and improve the speed. Just add a LEAVE statement when we meet a missing value.

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        do j=(i+1) to dim(d);
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
            if d(j) = . then leave;
        end;
        if d(i) = . then leave;
    end;
    drop j i _name_;
run;

PaigeMiller · Posted 11-15-2021 05:20 PM

Wouldn't this be even faster?

data want;
    set sample_t;
    array d date:;
    count=0;
    do i=1 to dim(d)-1;
        if d(i) = . then leave;
        do j=(i+1) to dim(d);
            if d(j) = . then leave;
            if 365<=abs(d(i)-d(j))<=730 then count=count+1;
        end;
    end;
    drop j i _name_;
run;

--
Paige Miller

Identify a specific date range among multiple observations of date

Re: Identify a specific date range among multiple observations of date

Re: Identify a specific date range among multiple observations of date

Re: Identify a specific date range among multiple observations of date

Re: Identify a specific date range among multiple observations of date

Re: Identify a specific date range among multiple observations of date

Re: Identify a specific date range among multiple observations of date

Re: Identify a specific date range among multiple observations of date

Registration is open

SAS Training: Just a Click Away