Solved: How to compute gap within specific time range for data with gaps and o...

SAS-questioner · Posted 10-11-2023 05:28 PM

I have a data set like below:

data have;
input ID (begindate	enddate)(:mmddyy10.) (begindate2	enddate2)(:mmddyy10.);
format begindate enddate begindate2	enddate2 mmddyy10.;
datalines;
1 3/4/2020     6/5/2020         5/30/2020    4/28/2021
1 5/16/2020    2/21/2021        5/30/2020    4/28/2021
1 7/18/2020    10/24/2020       5/30/2020    4/28/2021
1 12/30/2020   9/17/2022        5/30/2020    4/28/2021
2 6/20/2020    7/20/2020        7/12/2020    8/1/2021
2 9/1/2020     11/13/2020       7/12/2020    8/1/2021
2 1/15/2021    7/21/2021        7/12/2020    8/1/2021
3 5/20/2020    10/21/2020       3/10/2020    3/12/2021
3 8/10/2020    6/1/2021         3/10/2020    3/12/2021
;

The purpose is to compute the total days of gap of begindate and enddate within the range of begindate2 and enddate2.

For example, for ID two, based on the begindate and enddate several gap noticed: 7/20/2020 to 9/1/2020, and 11/13/2020 to 1/15/2021. And we also need to take the range of begindate2 and enddate2 into consideration, therefore, another gap noticed: from 7/21/2021 to 8/1/2021. For ID three, one gap also noticed: from 3/10/2020(begindate2) to 5/20/2020(begindate).

There is one thing I probably need to mention, some dates are not in the order. If you look at the second and third row of ID 1, the second row ends at 2/21/2021, and the third row begins at 7/18/2020.

Not only I want to identify those gaps by taking account of the range of begindate2 and enddate2, but also I need to compute the total days of those gaps. Can anyone help me with it? Thank you!

mkeintz · Posted 10-12-2023 01:11 PM

Since you want total gaps days, it seems you want only one observation per ID, with values for the ID, BEGINDATE2, ENDDATE2, and new variables TOTAL_DAYS_IN_RANGE, TOTAL_GAP_DAYS, and N_GAPS.

Of course, this assumes that each ID has a constant pair of BEGINDATE2 and ENDDATE2 for all observations.

The technique is to create a history array, indexed by date, ranging from begindate2 through enddate2. The array will have dummy variables, with a value of 1 for days covering all pairs of begindate/enddate within the begdate2/enddate2 ~~history~~ range, and zero for all other dates within the begdate2/enddate2 range (i.e. 0 for all gap dates). And dates in the array but outside of begindate2/enddate2, are all set to missing.

data have;
  input ID (begindate	enddate)(:mmddyy10.) (begindate2	enddate2)(:mmddyy10.);
  format begindate enddate begindate2	enddate2 mmddyy10.;
datalines;
1 3/4/2020     6/5/2020         5/30/2020    4/28/2021
1 5/16/2020    2/21/2021        5/30/2020    4/28/2021
1 7/18/2020    10/24/2020       5/30/2020    4/28/2021
1 12/30/2020   9/17/2022        5/30/2020    4/28/2021
2 6/20/2020    7/20/2020        7/12/2020    8/1/2021
2 9/1/2020     11/13/2020       7/12/2020    8/1/2021
2 1/15/2021    7/21/2021        7/12/2020    8/1/2021
3 5/20/2020    10/21/2020       3/10/2020    3/12/2021
3 8/10/2020    6/1/2021         3/10/2020    3/12/2021
run;

%let beg_study=01jan2020;
%let end_study=01jan2023;
%let study_length=%sysfunc(intck(day,"&beg_study"d,"&end_study"d));
%put _user_;

data want (keep=id begindate2 enddate2  total_: n_gaps);
  set have;
  by id ;
  array history {%sysevalf("01jan2020"d):%sysevalf("31dec2022"d)} _temporary_;

  if first.id then do;
    call missing(of history{*});
    do d=begindate2 to enddate2;
      history{d}=0;
    end;
  end;
  do d=max(begindate,begindate2) to min(enddate,enddate2);
    history{d}=1;
  end;

  if last.id;
  length strng $&study_length ;
  total_days_in_range = enddate2+1-begindate2;
  total_gap_days = total_days_in_range- sum(of history{*});

  strng=compress(cats(of history{*}),'.');    *Remove .'s and left justify*;
  n_gaps=countw(trim(strng),'1');             *Count gaps, but ignore trailing blanks*;
run;

At the end of each ID, just subtract the sum of 1's for (total nongap days) from the number of days in the begindate2/enddate2 range.

The n_gaps is calculated via this approach.

Make a concatenated string of .'s, 0',s and 1.s from the history array: cats(of history{*})
Remove all the .'s, and left justify: compress(cats(of history{*}),'.')
Trim the trailing blanks: trim(strng)
Count the number of words (where 1's are viewed as word separators.:
countw(trim(strng),'1')

In this case, a "word" is a string of zeroes, bounded on each side by a one, or by the beginning or end of the string.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ballardw · Posted 10-11-2023 06:43 PM

I think that you need to walk through actually calculating the value of the desired result for at least one case explaining where each number comes from. I get confused because of the constant use of "gap" and I can't tell which context it is used in your example. Also, since you don't show the result of any of the calculations in the example I have no idea what the result should be. I am also not sure exactly what

You also need to describe how the "out of order" you describe is to be handled.

What actual role in this process does the variable CATE play? I don't see it mentioned in the description anywhere. If not needed then drop it.

One suspects that the first step is identifying any gap completely ignoring your begindate2/enddate2 variables since they apparently do not change for any of the Id values. If they do, you need to provide examples.

Overlaps/continuous dates are relatively frequent questions but each has its own flavor.

SAS-questioner · Posted 10-11-2023 08:15 PM

Thank you for the reply. I will walk through the whole process. First of all, the begindate2/enddate2 do not change for any of the ID values because it's the standard time range that I am interested in. Begindate/enddate are the time range for a product that each person owned, they might own it for different time range. What I want to do is to find out, during the begindate2/enddate2, are there any time gaps for Begindate/enddate.

I will take ID 2 as example to demonstrate the process:

1, we use begindate2/enddate2 as the standard time range, which is from 7/12/2020 to 8/1/2021.

2, the first ID 2's Begindate/enddate are 6/20/2020 to 7/20/2020, 6/20/2020 is before 7/12/2020, and 7/20/2020 is before 8/1/2021, so there is no gap.

3, the second ID 2's Begindate/enddate are 9/1/2020 to 11/13/2020, compare to the first ID 2's Begindate/enddate, we notice that the enddate is 7/20/2020, but the begindate is 9/1/2020, so we compute the gap of days are (9/1/2020-7/20/2020) = 41.

4, the third ID 2's Begindate/enddate are 1/15/2021 to 7/21/2021, we noticed two time gap: the first gap is (1/15/2021-11/13/2020)= 53, and the second gap is (8/1/2021[enddate2]-7/21/2021)=9.

5. the total time gap within the range 7/12/2020 to 8/1/2021 are 41+53+9=103.

That's pretty much what I thought the process might be. And the reason that I mention the time order is because I notice that some enddate are later than the next begindate, I used to try the lag(enddate) , because the enddate are later than the next begindate, I don't think that statement works for this kind of situation?

Yeah, the CATE $ doesn't matter here, so I have removed it in the post. Thank you for taking the time!

mkeintz · Posted 10-12-2023 01:11 PM

Since you want total gaps days, it seems you want only one observation per ID, with values for the ID, BEGINDATE2, ENDDATE2, and new variables TOTAL_DAYS_IN_RANGE, TOTAL_GAP_DAYS, and N_GAPS.

Of course, this assumes that each ID has a constant pair of BEGINDATE2 and ENDDATE2 for all observations.

The technique is to create a history array, indexed by date, ranging from begindate2 through enddate2. The array will have dummy variables, with a value of 1 for days covering all pairs of begindate/enddate within the begdate2/enddate2 ~~history~~ range, and zero for all other dates within the begdate2/enddate2 range (i.e. 0 for all gap dates). And dates in the array but outside of begindate2/enddate2, are all set to missing.

data have;
  input ID (begindate	enddate)(:mmddyy10.) (begindate2	enddate2)(:mmddyy10.);
  format begindate enddate begindate2	enddate2 mmddyy10.;
datalines;
1 3/4/2020     6/5/2020         5/30/2020    4/28/2021
1 5/16/2020    2/21/2021        5/30/2020    4/28/2021
1 7/18/2020    10/24/2020       5/30/2020    4/28/2021
1 12/30/2020   9/17/2022        5/30/2020    4/28/2021
2 6/20/2020    7/20/2020        7/12/2020    8/1/2021
2 9/1/2020     11/13/2020       7/12/2020    8/1/2021
2 1/15/2021    7/21/2021        7/12/2020    8/1/2021
3 5/20/2020    10/21/2020       3/10/2020    3/12/2021
3 8/10/2020    6/1/2021         3/10/2020    3/12/2021
run;

%let beg_study=01jan2020;
%let end_study=01jan2023;
%let study_length=%sysfunc(intck(day,"&beg_study"d,"&end_study"d));
%put _user_;

data want (keep=id begindate2 enddate2  total_: n_gaps);
  set have;
  by id ;
  array history {%sysevalf("01jan2020"d):%sysevalf("31dec2022"d)} _temporary_;

  if first.id then do;
    call missing(of history{*});
    do d=begindate2 to enddate2;
      history{d}=0;
    end;
  end;
  do d=max(begindate,begindate2) to min(enddate,enddate2);
    history{d}=1;
  end;

  if last.id;
  length strng $&study_length ;
  total_days_in_range = enddate2+1-begindate2;
  total_gap_days = total_days_in_range- sum(of history{*});

  strng=compress(cats(of history{*}),'.');    *Remove .'s and left justify*;
  n_gaps=countw(trim(strng),'1');             *Count gaps, but ignore trailing blanks*;
run;

At the end of each ID, just subtract the sum of 1's for (total nongap days) from the number of days in the begindate2/enddate2 range.

The n_gaps is calculated via this approach.

Make a concatenated string of .'s, 0',s and 1.s from the history array: cats(of history{*})
Remove all the .'s, and left justify: compress(cats(of history{*}),'.')
Trim the trailing blanks: trim(strng)
Count the number of words (where 1's are viewed as word separators.:
countw(trim(strng),'1')

In this case, a "word" is a string of zeroes, bounded on each side by a one, or by the beginning or end of the string.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS-questioner · Posted 10-13-2023 10:45 AM

You are a genius! It took me a while to understand your code. I couldn't create this code by myself ever! How did you come up this idea? Do you know any sources to improve my SAS code skills?

mkeintz · Posted 10-13-2023 11:50 PM

@SAS-questioner wrote:
... stuff deleted ...
How did you come up this idea? ...

Forget the functions for generating N_GAPS. The tool of primary interest here (the "idea") is the HISTORY array. The idea of using an array indexed by dates is an old one, but (to my knowledge) is not a technique taught in SAS programming classes. (Only so much can fit in the syllabus).

There are two important aspects of this array:

It is declared as _TEMPORARY_, so
- It does not create variables that will be in the new data set.
- Its values are automatically RETAINED from obs to obs, so you can build the array over a sequence of observations.
The array has lower and upper bounds corresponding to the dates 01jan2020 and 31dec2022.

Regarding the lower/upper bounds point: Usually you can declare an array like

array history {1096} _temporary_;

which has elements history{1} through history{1096}.

But you can also declare an array like this:

array history {21915:23010} _temporary_ ;

which also has 1096 elements, but the leftmost element is history{21915} (not history{1}) and the rightmost is history{23010} instead of history{1096}.

And why did I choose those values as the bounds? Because they are the SAS date values for 01jan2020 and 31dec2022 (which I took as the likely earliest and latest dates in your study).

Unfortunately, SAS doesn't accept this intuitively obvious syntax

array history {"01jan2020"d:"31dec2022"d} _temporary_ ;

So, because I didn't want to manually determine the actual date values 21915 and 23010, I used the macro function %sysevalf to do it for me. But the macro function is just icing on the cake.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How to compute gap within specific time range for data with gaps and overlaps?

Re: How to compute gap within specific time range for data with gaps and overlaps?

Re: How to compute gap within specific time range for data with gaps and overlaps?

Re: How to compute gap within specific time range for data with gaps and overlaps?

Re: How to compute gap within specific time range for data with gaps and overlaps?

Re: How to compute gap within specific time range for data with gaps and overlaps?

Re: How to compute gap within specific time range for data with gaps and overlaps?

Classroom Training Available!