Solved: Re: Comparing dates and creating variable

newsas007 · Posted 02-12-2021 08:52 PM

Hi all,

I am trying to compare the start and end dates with the entered date variable and create a new variable for status. Each ID will have only one entered date.

Rule 1 – If missing Start and End by ID then Status = Unknown

Rule 2 – if not missing start and end; If start or End dates are available then compare the closest date that matches ‘Entered’ variable date and output.

For example, for ID #1 only 1 ‘Start’ date is available, Comparing this date with ‘Entered’ date i.e. Entered Date > Start Date and the corresponding ‘LOTS’ = 2, therefore output Status = After 2.

Similarly, for ID# 13, Only one available END date and this end date corresponds to LOTS =2, therefore output LOTS = After 2.

Next for IDs with multiple complete start and end dates choose the closest end date compared to Entered Date. For example, ID#2 has 3 start and end dates and the closest end date for the Entered date happened to be in LOTS=3, therefore status = 3.

For ID#3, the Entered date > the closest start date (8/14/2016) therefore Status = After 2

Finally, For any LOTS > 4, the status will always be After 4. For example ID#20 and 21, the LOTS are greater than 4 so therefore status will be After 4

Data Want:

ID	Entered	Start	End	LOTS	Status
1	12/13/2016	9/21/2016		2	After 2
2	12/14/2016	6/14/2016	11/14/2016	3	After 3
3	1/2/2017	8/14/2016		2	After 2
4	1/25/2017	9/14/2016	12/15/2016	2	After 2
5	3/23/2017	11/14/2016	2/14/2017	0	After 0
6	4/30/2017	1/10/2017	4/14/2017	2	After 2
7	7/17/2017			.	Unknown
8	8/24/2017			.	Unknown
9	1/31/2018			.	Unknown
10	2/11/2018	10/12/2017	12/21/2017	-1	After -1
11	4/22/2018	9/4/2017	1/15/2018	-1	After -1
12	6/11/2018	3/17/2018	5/15/2018	3	After 3
13	6/7/2018		2/1/2018	2	After 2
14	7/30/2018	9/14/2017	4/30/2018	1	After 1
15	8/19/2018			.	Unknown
16	8/21/2018	6/14/2018	6/15/2018	0	After 0
17	9/10/2018	4/8/2018	5/31/2018	0	After 0
18	12/12/2018	7/15/2018	11/14/2018	4	After 4
19	12/16/2018	9/14/2018	11/14/2018	1	After 1
20	1/14/2019	11/15/2015	1/25/2017	7	After 4
21	3/20/2019	8/14/2017	1/14/2018	5	After 4
22	4/23/2019	4/7/2019	4/8/2019	0	After 0
23	7/10/2019			.	Unknown

Data Have:

ID	Entered	Start	End	LOTS
1	12/13/2016			1
1	12/13/2016	9/21/2016		2
2	12/14/2016	10/28/2015	12/28/2015	1
2	12/14/2016	2/14/2016	6/13/2016	2
2	12/14/2016	6/14/2016	11/14/2016	3
3	1/2/2017	10/15/2011	8/14/2015	0
3	1/2/2017	10/23/2015	8/13/2016	1
3	1/2/2017	8/14/2016		2
4	1/25/2017	4/14/2016	9/13/2016	1
4	1/25/2017	9/14/2016	12/15/2016	2
5	3/23/2017	11/14/2016	2/14/2017	0
6	4/30/2017	3/24/2016	12/18/2016	1
6	4/30/2017	1/10/2017	4/14/2017	2
7	7/17/2017			.
8	8/24/2017			.
9	1/31/2018			.
10	2/11/2018	10/12/2017	12/21/2017	-1
11	4/22/2018	9/4/2017	1/15/2018	-1
12	6/11/2018	4/14/2017	8/14/2017	1
12	6/11/2018	10/15/2017	2/14/2018	2
12	6/11/2018	3/17/2018	5/15/2018	3
13	6/7/2018			0
13	6/7/2018			1
13	6/7/2018		2/1/2018	2
14	7/30/2018	9/14/2017	4/30/2018	1
15	8/19/2018			.
16	8/21/2018	6/14/2018	6/15/2018	0
17	9/10/2018	4/8/2018	5/31/2018	0
18	12/12/2018	7/15/2018	11/14/2018	4
19	12/16/2018	3/27/2015	9/13/2018	0
19	12/16/2018	9/14/2018	11/14/2018	1
20	1/14/2019	11/15/2015	1/25/2017	7
21	3/20/2019	8/14/2017	1/14/2018	5
21	3/20/2019			2
22	4/23/2019	4/7/2019	4/8/2019	0
23	7/10/2019			.

mkeintz · Posted 02-13-2021 03:46 PM

~~You still haven't answered my question about what you want when the ENTERED DATE is equi-distant from the closest START and closest END.~~ Editted note: when equi-distant choose the END date an LOTS value.

But in ignorance of that objective, below is code that I believe you can use for the objectives as I understand them.

The strategy is to check each incoming START and END, keeping progressive track of the closest date value (CLOSEST_DATE) and its distance (CLOSEST_DIST) from ENTERED. And also keep track of the observation number (variable PTR) containing that date.

Then, at the end of each ID, re-reread the observation indicated by PTR, as the sole output record for that ID. Here's the code:

data have;
  input id (entered start end) (:mmddyy8.)  lots;
  format entered start end mmddyy10.;
datalines;
1       12/13/2016            .          .              1
1       12/13/2016      9/21/2016        .              2
2       12/14/2016      10/28/2015      12/28/2015      1
2       12/14/2016      2/14/2016       6/13/2016       2
2       12/14/2016      6/14/2016       11/14/2016      3
3       1/2/2017        10/15/2011      8/14/2015       0
3       1/2/2017        10/23/2015      8/13/2016       1
3       1/2/2017        8/14/2016        .              2
4       1/25/2017       4/14/2016       9/13/2016       1
4       1/25/2017       9/14/2016       12/15/2016      2
5       3/23/2017       11/14/2016      2/14/2017       0
6       4/30/2017       3/24/2016       12/18/2016      1
6       4/30/2017       1/10/2017       4/14/2017       2
7       7/17/2017            .           .              .
8       8/24/2017            .           .              .
9       1/31/2018            .           .              .
10      2/11/2018       10/12/2017      12/21/2017      -1
11      4/22/2018       9/4/2017        1/15/2018       -1
12      6/11/2018       4/14/2017       8/14/2017       1
12      6/11/2018       10/15/2017      2/14/2018       2
12      6/11/2018       3/17/2018       5/15/2018       3
13      6/7/2018             .           .              0
13      6/7/2018             .           .              1
13      6/7/2018             .           2/1/2018       2
14      7/30/2018       9/14/2017       4/30/2018       1
15      8/19/2018            .           .              .
16      8/21/2018       6/14/2018       6/15/2018       0
17      9/10/2018       4/8/2018        5/31/2018       0
18      12/12/2018      7/15/2018       11/14/2018      4
19      12/16/2018      3/27/2015       9/13/2018       0
19      12/16/2018      9/14/2018       11/14/2018      1
20      1/14/2019       11/15/2015      1/25/2017       7
21      3/20/2019       8/14/2017       1/14/2018       5
21      3/20/2019           .            .              2
22      4/23/2019       4/7/2019        4/8/2019        0
23      7/10/2019           .            .              .
run;

data want (drop=closest_: date);
  set have;
  by id;
  if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

  retain closest_distance closest_date;
  if first.id then call missing(of closest_:);

  do date=start,end;
    if date=. then continue;  ** Skip this iteration if date is missing **;
    if min(abs(entered-date),closest_distance) = abs(entered-date) then do;
      closest_date=date;
      closest_distance=abs(entered-date);
      ptr=_n_;
    end;
  end;

  if last.id;
  set have point=ptr;   ** Re-read the designated observation **;
  length status $9;
  if nmiss(start,end)=2 then status='Unknown'; else
  if lots>4 then status='After 4'; else       /** Revised **/
  status=catx(' ','After',lots);
run;

BTW, your dataset WANT has 2 observations (edited addition: for ID 19), which I presume is in error. This code produces 1 obs per id.

Comments on changed code:

After your response to the first version of this not, I modified the code above. I added the statement

 if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

which prevents carrying forward PTR values when an ID generates an UNKNOWN status (no minimum values found). And also added the line:

if lots>4 then status='After 4'; else

Also, if the closest END and closest START are equi-distant from ENTERED, this will chose the last qualifying date encountered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 02-12-2021 11:41 PM

What if you have:

ID	Entered	Start	End	Lots
25	10/15/2020			0
25	10/15/2020	9/15/2020		1
25	10/15/2020		11/15/2020	2

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

newsas007 · Posted 02-13-2021 12:53 AM

The entered date (10/15/2020) is closest/nearest to end date ( 11/15/2020); so therefore it is after 2

mkeintz · Posted 02-13-2021 10:59 AM

And what if the entered date is exactly half-way between START and END?

What if you have:

ID	Entered	Start	End	Lots
25	10/15/2020			0
25	10/15/2020	9/15/2020		1
25	10/15/2020		11/14/2020	2

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

newsas007 · Posted 02-13-2021 11:18 AM

This scenario where there is a end date greater than entered date doesn't happen. Basically, row #3 will not exist.

ballardw · Posted 02-13-2021 11:29 AM

@newsas007 wrote:
This scenario where there is a end date greater than entered date doesn't happen. Basically, row #3 will not exist.

I'll say that I hope that is true.

In the data I deal with I have people that "end" a program before they "start" because of data entry errors.

And people with medical tests on dates that will not occur until several years in the future or several years before the testing program began.

If you have confirmed your statement, good for you. If you believe that without testing it you might be surprised. "Shouldn't" unfortunately does not mean "doesn't" in real data entry.

newsas007 · Posted 02-13-2021 11:36 AM

@ballardw, That is an excellent point. Just in case this does happen, then it should be after 1.

mkeintz · Posted 02-13-2021 03:46 PM

~~You still haven't answered my question about what you want when the ENTERED DATE is equi-distant from the closest START and closest END.~~ Editted note: when equi-distant choose the END date an LOTS value.

But in ignorance of that objective, below is code that I believe you can use for the objectives as I understand them.

The strategy is to check each incoming START and END, keeping progressive track of the closest date value (CLOSEST_DATE) and its distance (CLOSEST_DIST) from ENTERED. And also keep track of the observation number (variable PTR) containing that date.

Then, at the end of each ID, re-reread the observation indicated by PTR, as the sole output record for that ID. Here's the code:

data have;
  input id (entered start end) (:mmddyy8.)  lots;
  format entered start end mmddyy10.;
datalines;
1       12/13/2016            .          .              1
1       12/13/2016      9/21/2016        .              2
2       12/14/2016      10/28/2015      12/28/2015      1
2       12/14/2016      2/14/2016       6/13/2016       2
2       12/14/2016      6/14/2016       11/14/2016      3
3       1/2/2017        10/15/2011      8/14/2015       0
3       1/2/2017        10/23/2015      8/13/2016       1
3       1/2/2017        8/14/2016        .              2
4       1/25/2017       4/14/2016       9/13/2016       1
4       1/25/2017       9/14/2016       12/15/2016      2
5       3/23/2017       11/14/2016      2/14/2017       0
6       4/30/2017       3/24/2016       12/18/2016      1
6       4/30/2017       1/10/2017       4/14/2017       2
7       7/17/2017            .           .              .
8       8/24/2017            .           .              .
9       1/31/2018            .           .              .
10      2/11/2018       10/12/2017      12/21/2017      -1
11      4/22/2018       9/4/2017        1/15/2018       -1
12      6/11/2018       4/14/2017       8/14/2017       1
12      6/11/2018       10/15/2017      2/14/2018       2
12      6/11/2018       3/17/2018       5/15/2018       3
13      6/7/2018             .           .              0
13      6/7/2018             .           .              1
13      6/7/2018             .           2/1/2018       2
14      7/30/2018       9/14/2017       4/30/2018       1
15      8/19/2018            .           .              .
16      8/21/2018       6/14/2018       6/15/2018       0
17      9/10/2018       4/8/2018        5/31/2018       0
18      12/12/2018      7/15/2018       11/14/2018      4
19      12/16/2018      3/27/2015       9/13/2018       0
19      12/16/2018      9/14/2018       11/14/2018      1
20      1/14/2019       11/15/2015      1/25/2017       7
21      3/20/2019       8/14/2017       1/14/2018       5
21      3/20/2019           .            .              2
22      4/23/2019       4/7/2019        4/8/2019        0
23      7/10/2019           .            .              .
run;

data want (drop=closest_: date);
  set have;
  by id;
  if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

  retain closest_distance closest_date;
  if first.id then call missing(of closest_:);

  do date=start,end;
    if date=. then continue;  ** Skip this iteration if date is missing **;
    if min(abs(entered-date),closest_distance) = abs(entered-date) then do;
      closest_date=date;
      closest_distance=abs(entered-date);
      ptr=_n_;
    end;
  end;

  if last.id;
  set have point=ptr;   ** Re-read the designated observation **;
  length status $9;
  if nmiss(start,end)=2 then status='Unknown'; else
  if lots>4 then status='After 4'; else       /** Revised **/
  status=catx(' ','After',lots);
run;

BTW, your dataset WANT has 2 observations (edited addition: for ID 19), which I presume is in error. This code produces 1 obs per id.

Comments on changed code:

After your response to the first version of this not, I modified the code above. I added the statement

 if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

which prevents carrying forward PTR values when an ID generates an UNKNOWN status (no minimum values found). And also added the line:

if lots>4 then status='After 4'; else

Also, if the closest END and closest START are equi-distant from ENTERED, this will chose the last qualifying date encountered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

newsas007 · Posted 02-13-2021 05:17 PM

Hi @mkeintz,
Sorry, when ENTERED DATE is equi-distant from the closest START and closest END - then it gets assigned to the end or corresponding END LOTS.
Also, this program works for the most part but the 'Unkowns' are not listed. ID#6 repeats several times as well.
Yes, i would like to have just all Unique IDs in the final data.
Thank you so much,

newsas007 · Posted 02-15-2021 12:30 PM

This works great! Thank you

Classroom Training Available!