I have a data set like below:
data have; input ID (begindate enddate)(:mmddyy10.) (begindate2 enddate2)(:mmddyy10.); format begindate enddate begindate2 enddate2 mmddyy10.; datalines; 1 3/4/2020 6/5/2020 5/30/2020 4/28/2021 1 5/16/2020 2/21/2021 5/30/2020 4/28/2021 1 7/18/2020 10/24/2020 5/30/2020 4/28/2021 1 12/30/2020 9/17/2022 5/30/2020 4/28/2021 2 6/20/2020 7/20/2020 7/12/2020 8/1/2021 2 9/1/2020 11/13/2020 7/12/2020 8/1/2021 2 1/15/2021 7/21/2021 7/12/2020 8/1/2021 3 5/20/2020 10/21/2020 3/10/2020 3/12/2021 3 8/10/2020 6/1/2021 3/10/2020 3/12/2021 ;
The purpose is to compute the total days of gap of begindate
and enddate
within the range of begindate2
and enddate2
.
For example, for ID two, based on the begindate
and enddate
several gap noticed: 7/20/2020 to 9/1/2020, and 11/13/2020 to 1/15/2021. And we also need to take the range of begindate2
and enddate2
into consideration, therefore, another gap noticed: from 7/21/2021 to 8/1/2021. For ID three, one gap also noticed: from 3/10/2020(begindate2
) to 5/20/2020(begindate
).
There is one thing I probably need to mention, some dates are not in the order. If you look at the second and third row of ID 1, the second row ends at 2/21/2021, and the third row begins at 7/18/2020.
Not only I want to identify those gaps by taking account of the range of begindate2
and enddate2
, but also I need to compute the total days of those gaps. Can anyone help me with it? Thank you!
Since you want total gaps days, it seems you want only one observation per ID, with values for the ID, BEGINDATE2, ENDDATE2, and new variables TOTAL_DAYS_IN_RANGE, TOTAL_GAP_DAYS, and N_GAPS.
Of course, this assumes that each ID has a constant pair of BEGINDATE2 and ENDDATE2 for all observations.
The technique is to create a history array, indexed by date, ranging from begindate2 through enddate2. The array will have dummy variables, with a value of 1 for days covering all pairs of begindate/enddate within the begdate2/enddate2 history range, and zero for all other dates within the begdate2/enddate2 range (i.e. 0 for all gap dates). And dates in the array but outside of begindate2/enddate2, are all set to missing.
data have;
input ID (begindate enddate)(:mmddyy10.) (begindate2 enddate2)(:mmddyy10.);
format begindate enddate begindate2 enddate2 mmddyy10.;
datalines;
1 3/4/2020 6/5/2020 5/30/2020 4/28/2021
1 5/16/2020 2/21/2021 5/30/2020 4/28/2021
1 7/18/2020 10/24/2020 5/30/2020 4/28/2021
1 12/30/2020 9/17/2022 5/30/2020 4/28/2021
2 6/20/2020 7/20/2020 7/12/2020 8/1/2021
2 9/1/2020 11/13/2020 7/12/2020 8/1/2021
2 1/15/2021 7/21/2021 7/12/2020 8/1/2021
3 5/20/2020 10/21/2020 3/10/2020 3/12/2021
3 8/10/2020 6/1/2021 3/10/2020 3/12/2021
run;
%let beg_study=01jan2020;
%let end_study=01jan2023;
%let study_length=%sysfunc(intck(day,"&beg_study"d,"&end_study"d));
%put _user_;
data want (keep=id begindate2 enddate2 total_: n_gaps);
set have;
by id ;
array history {%sysevalf("01jan2020"d):%sysevalf("31dec2022"d)} _temporary_;
if first.id then do;
call missing(of history{*});
do d=begindate2 to enddate2;
history{d}=0;
end;
end;
do d=max(begindate,begindate2) to min(enddate,enddate2);
history{d}=1;
end;
if last.id;
length strng $&study_length ;
total_days_in_range = enddate2+1-begindate2;
total_gap_days = total_days_in_range- sum(of history{*});
strng=compress(cats(of history{*}),'.'); *Remove .'s and left justify*;
n_gaps=countw(trim(strng),'1'); *Count gaps, but ignore trailing blanks*;
run;
At the end of each ID, just subtract the sum of 1's for (total nongap days) from the number of days in the begindate2/enddate2 range.
The n_gaps is calculated via this approach.
In this case, a "word" is a string of zeroes, bounded on each side by a one, or by the beginning or end of the string.
I think that you need to walk through actually calculating the value of the desired result for at least one case explaining where each number comes from. I get confused because of the constant use of "gap" and I can't tell which context it is used in your example. Also, since you don't show the result of any of the calculations in the example I have no idea what the result should be. I am also not sure exactly what
You also need to describe how the "out of order" you describe is to be handled.
What actual role in this process does the variable CATE play? I don't see it mentioned in the description anywhere. If not needed then drop it.
One suspects that the first step is identifying any gap completely ignoring your begindate2/enddate2 variables since they apparently do not change for any of the Id values. If they do, you need to provide examples.
Overlaps/continuous dates are relatively frequent questions but each has its own flavor.
Thank you for the reply. I will walk through the whole process. First of all, the begindate2/enddate2 do not change for any of the ID values because it's the standard time range that I am interested in. Begindate/enddate are the time range for a product that each person owned, they might own it for different time range. What I want to do is to find out, during the begindate2/enddate2, are there any time gaps for Begindate/enddate.
I will take ID 2 as example to demonstrate the process:
1, we use begindate2/enddate2 as the standard time range, which is from 7/12/2020 to 8/1/2021.
2, the first ID 2's Begindate/enddate are 6/20/2020 to 7/20/2020, 6/20/2020 is before 7/12/2020, and 7/20/2020 is before 8/1/2021, so there is no gap.
3, the second ID 2's Begindate/enddate are 9/1/2020 to 11/13/2020, compare to the first ID 2's Begindate/enddate, we notice that the enddate is 7/20/2020, but the begindate is 9/1/2020, so we compute the gap of days are (9/1/2020-7/20/2020) = 41.
4, the third ID 2's Begindate/enddate are 1/15/2021 to 7/21/2021, we noticed two time gap: the first gap is (1/15/2021-11/13/2020)= 53, and the second gap is (8/1/2021[enddate2]-7/21/2021)=9.
5. the total time gap within the range 7/12/2020 to 8/1/2021 are 41+53+9=103.
That's pretty much what I thought the process might be. And the reason that I mention the time order is because I notice that some enddate are later than the next begindate, I used to try the lag(enddate)
, because the enddate are later than the next begindate, I don't think that statement works for this kind of situation?
Yeah, the CATE $ doesn't matter here, so I have removed it in the post. Thank you for taking the time!
Since you want total gaps days, it seems you want only one observation per ID, with values for the ID, BEGINDATE2, ENDDATE2, and new variables TOTAL_DAYS_IN_RANGE, TOTAL_GAP_DAYS, and N_GAPS.
Of course, this assumes that each ID has a constant pair of BEGINDATE2 and ENDDATE2 for all observations.
The technique is to create a history array, indexed by date, ranging from begindate2 through enddate2. The array will have dummy variables, with a value of 1 for days covering all pairs of begindate/enddate within the begdate2/enddate2 history range, and zero for all other dates within the begdate2/enddate2 range (i.e. 0 for all gap dates). And dates in the array but outside of begindate2/enddate2, are all set to missing.
data have;
input ID (begindate enddate)(:mmddyy10.) (begindate2 enddate2)(:mmddyy10.);
format begindate enddate begindate2 enddate2 mmddyy10.;
datalines;
1 3/4/2020 6/5/2020 5/30/2020 4/28/2021
1 5/16/2020 2/21/2021 5/30/2020 4/28/2021
1 7/18/2020 10/24/2020 5/30/2020 4/28/2021
1 12/30/2020 9/17/2022 5/30/2020 4/28/2021
2 6/20/2020 7/20/2020 7/12/2020 8/1/2021
2 9/1/2020 11/13/2020 7/12/2020 8/1/2021
2 1/15/2021 7/21/2021 7/12/2020 8/1/2021
3 5/20/2020 10/21/2020 3/10/2020 3/12/2021
3 8/10/2020 6/1/2021 3/10/2020 3/12/2021
run;
%let beg_study=01jan2020;
%let end_study=01jan2023;
%let study_length=%sysfunc(intck(day,"&beg_study"d,"&end_study"d));
%put _user_;
data want (keep=id begindate2 enddate2 total_: n_gaps);
set have;
by id ;
array history {%sysevalf("01jan2020"d):%sysevalf("31dec2022"d)} _temporary_;
if first.id then do;
call missing(of history{*});
do d=begindate2 to enddate2;
history{d}=0;
end;
end;
do d=max(begindate,begindate2) to min(enddate,enddate2);
history{d}=1;
end;
if last.id;
length strng $&study_length ;
total_days_in_range = enddate2+1-begindate2;
total_gap_days = total_days_in_range- sum(of history{*});
strng=compress(cats(of history{*}),'.'); *Remove .'s and left justify*;
n_gaps=countw(trim(strng),'1'); *Count gaps, but ignore trailing blanks*;
run;
At the end of each ID, just subtract the sum of 1's for (total nongap days) from the number of days in the begindate2/enddate2 range.
The n_gaps is calculated via this approach.
In this case, a "word" is a string of zeroes, bounded on each side by a one, or by the beginning or end of the string.
@SAS-questioner wrote:
... stuff deleted ...
How did you come up this idea? ...
Forget the functions for generating N_GAPS. The tool of primary interest here (the "idea") is the HISTORY array. The idea of using an array indexed by dates is an old one, but (to my knowledge) is not a technique taught in SAS programming classes. (Only so much can fit in the syllabus).
There are two important aspects of this array:
Regarding the lower/upper bounds point: Usually you can declare an array like
array history {1096} _temporary_;
which has elements history{1} through history{1096}.
But you can also declare an array like this:
array history {21915:23010} _temporary_ ;
which also has 1096 elements, but the leftmost element is history{21915} (not history{1}) and the rightmost is history{23010} instead of history{1096}.
And why did I choose those values as the bounds? Because they are the SAS date values for 01jan2020 and 31dec2022 (which I took as the likely earliest and latest dates in your study).
Unfortunately, SAS doesn't accept this intuitively obvious syntax
array history {"01jan2020"d:"31dec2022"d} _temporary_ ;
So, because I didn't want to manually determine the actual date values 21915 and 23010, I used the macro function %sysevalf to do it for me. But the macro function is just icing on the cake.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.