BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
newsas007
Quartz | Level 8

Hi all,

I am trying to compare the start and end dates with the entered date variable and create a new variable for status. Each ID will have only one entered date.

Rule 1 – If missing Start and End by ID then Status = Unknown

 

Rule 2 – if not missing start and end; If start or End dates are available then compare the closest date that matches ‘Entered’ variable date and output.

 

For example, for ID #1 only 1 ‘Start’ date is available, Comparing this date with ‘Entered’ date i.e. Entered Date > Start Date and the corresponding ‘LOTS’ = 2, therefore output Status = After 2.

 

Similarly, for ID# 13, Only one available END date and this end date corresponds to LOTS =2, therefore output LOTS = After 2.

 

Next for IDs with multiple complete start and end dates choose the closest end date compared to Entered Date. For example, ID#2 has 3 start and end dates and the closest end date for the Entered date happened to be in LOTS=3, therefore status = 3.

 

For ID#3, the Entered date > the closest start date (8/14/2016) therefore Status = After 2

 

Finally, For any LOTS > 4, the status will always be After 4. For example ID#20 and 21, the LOTS are greater than 4 so therefore status will be After 4

 

Data Want:

ID Entered Start End LOTS Status
1 12/13/2016 9/21/2016   2 After 2
2 12/14/2016 6/14/2016 11/14/2016 3 After 3
3 1/2/2017 8/14/2016   2 After 2
4 1/25/2017 9/14/2016 12/15/2016 2 After 2
5 3/23/2017 11/14/2016 2/14/2017 0 After 0
6 4/30/2017 1/10/2017 4/14/2017 2 After 2
7 7/17/2017     . Unknown
8 8/24/2017     . Unknown
9 1/31/2018     . Unknown
10 2/11/2018 10/12/2017 12/21/2017 -1 After -1
11 4/22/2018 9/4/2017 1/15/2018 -1 After -1
12 6/11/2018 3/17/2018 5/15/2018 3 After 3
13 6/7/2018   2/1/2018 2 After 2
14 7/30/2018 9/14/2017 4/30/2018 1 After 1
15 8/19/2018     . Unknown
16 8/21/2018 6/14/2018 6/15/2018 0 After 0
17 9/10/2018 4/8/2018 5/31/2018 0 After 0
18 12/12/2018 7/15/2018 11/14/2018 4 After 4
19 12/16/2018 9/14/2018 11/14/2018 1 After 1
20 1/14/2019 11/15/2015 1/25/2017 7 After 4
21 3/20/2019 8/14/2017 1/14/2018 5 After 4
22 4/23/2019 4/7/2019 4/8/2019 0 After 0
23 7/10/2019     . Unknown

 

Data Have:

ID Entered Start End LOTS
1 12/13/2016     1
1 12/13/2016 9/21/2016   2
2 12/14/2016 10/28/2015 12/28/2015 1
2 12/14/2016 2/14/2016 6/13/2016 2
2 12/14/2016 6/14/2016 11/14/2016 3
3 1/2/2017 10/15/2011 8/14/2015 0
3 1/2/2017 10/23/2015 8/13/2016 1
3 1/2/2017 8/14/2016   2
4 1/25/2017 4/14/2016 9/13/2016 1
4 1/25/2017 9/14/2016 12/15/2016 2
5 3/23/2017 11/14/2016 2/14/2017 0
6 4/30/2017 3/24/2016 12/18/2016 1
6 4/30/2017 1/10/2017 4/14/2017 2
7 7/17/2017     .
8 8/24/2017     .
9 1/31/2018     .
10 2/11/2018 10/12/2017 12/21/2017 -1
11 4/22/2018 9/4/2017 1/15/2018 -1
12 6/11/2018 4/14/2017 8/14/2017 1
12 6/11/2018 10/15/2017 2/14/2018 2
12 6/11/2018 3/17/2018 5/15/2018 3
13 6/7/2018     0
13 6/7/2018     1
13 6/7/2018   2/1/2018 2
14 7/30/2018 9/14/2017 4/30/2018 1
15 8/19/2018     .
16 8/21/2018 6/14/2018 6/15/2018 0
17 9/10/2018 4/8/2018 5/31/2018 0
18 12/12/2018 7/15/2018 11/14/2018 4
19 12/16/2018 3/27/2015 9/13/2018 0
19 12/16/2018 9/14/2018 11/14/2018 1
20 1/14/2019 11/15/2015 1/25/2017 7
21 3/20/2019 8/14/2017 1/14/2018 5
21 3/20/2019     2
22 4/23/2019 4/7/2019 4/8/2019 0
23 7/10/2019     .
1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You still haven't answered my question about what you want when the ENTERED DATE is equi-distant from the closest START and closest END.  Editted note: when equi-distant choose the END date an LOTS value.

 

But in ignorance of that objective, below is code that I believe you can use for the objectives as I understand them.

 

The strategy is to check each incoming START and END, keeping progressive track of the closest date value (CLOSEST_DATE) and its distance (CLOSEST_DIST) from ENTERED.  And also keep track of the observation number (variable PTR) containing that date.

 

Then, at the end of each ID, re-reread the observation indicated by PTR, as the sole output record for that ID.  Here's the code:

 

data have;
  input id (entered start end) (:mmddyy8.)  lots;
  format entered start end mmddyy10.;
datalines;
1       12/13/2016            .          .              1
1       12/13/2016      9/21/2016        .              2
2       12/14/2016      10/28/2015      12/28/2015      1
2       12/14/2016      2/14/2016       6/13/2016       2
2       12/14/2016      6/14/2016       11/14/2016      3
3       1/2/2017        10/15/2011      8/14/2015       0
3       1/2/2017        10/23/2015      8/13/2016       1
3       1/2/2017        8/14/2016        .              2
4       1/25/2017       4/14/2016       9/13/2016       1
4       1/25/2017       9/14/2016       12/15/2016      2
5       3/23/2017       11/14/2016      2/14/2017       0
6       4/30/2017       3/24/2016       12/18/2016      1
6       4/30/2017       1/10/2017       4/14/2017       2
7       7/17/2017            .           .              .
8       8/24/2017            .           .              .
9       1/31/2018            .           .              .
10      2/11/2018       10/12/2017      12/21/2017      -1
11      4/22/2018       9/4/2017        1/15/2018       -1
12      6/11/2018       4/14/2017       8/14/2017       1
12      6/11/2018       10/15/2017      2/14/2018       2
12      6/11/2018       3/17/2018       5/15/2018       3
13      6/7/2018             .           .              0
13      6/7/2018             .           .              1
13      6/7/2018             .           2/1/2018       2
14      7/30/2018       9/14/2017       4/30/2018       1
15      8/19/2018            .           .              .
16      8/21/2018       6/14/2018       6/15/2018       0
17      9/10/2018       4/8/2018        5/31/2018       0
18      12/12/2018      7/15/2018       11/14/2018      4
19      12/16/2018      3/27/2015       9/13/2018       0
19      12/16/2018      9/14/2018       11/14/2018      1
20      1/14/2019       11/15/2015      1/25/2017       7
21      3/20/2019       8/14/2017       1/14/2018       5
21      3/20/2019           .            .              2
22      4/23/2019       4/7/2019        4/8/2019        0
23      7/10/2019           .            .              .
run;

data want (drop=closest_: date);
  set have;
  by id;
  if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

  retain closest_distance closest_date;
  if first.id then call missing(of closest_:);

  do date=start,end;
    if date=. then continue;  ** Skip this iteration if date is missing **;
    if min(abs(entered-date),closest_distance) = abs(entered-date) then do;
      closest_date=date;
      closest_distance=abs(entered-date);
      ptr=_n_;
    end;
  end;

  if last.id;
  set have point=ptr;   ** Re-read the designated observation **;
  length status $9;
  if nmiss(start,end)=2 then status='Unknown'; else
  if lots>4 then status='After 4'; else       /** Revised **/
  status=catx(' ','After',lots);
run;

BTW, your dataset WANT has 2 observations (edited addition: for ID 19), which I presume is in error.  This code produces 1 obs per id.  

 

Comments on changed code:

After your response to the first version of this not, I modified the code above.  I added the statement

 if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

which prevents carrying forward PTR values when an ID generates an UNKNOWN status (no minimum values found).  And also added the line: 

if lots>4 then status='After 4'; else

Also, if the closest END and closest START are equi-distant from ENTERED, this will chose the last qualifying date encountered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

9 REPLIES 9
mkeintz
PROC Star

What if you have:

 

ID Entered Start End Lots
25 10/15/2020     0
25 10/15/2020   9/15/2020   1
25 10/15/2020  

11/15/2020

2
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
newsas007
Quartz | Level 8
The entered date (10/15/2020) is closest/nearest to end date ( 11/15/2020); so therefore it is after 2
mkeintz
PROC Star

And what if the entered date is exactly half-way between START and END?

 

What if you have:

 

ID Entered Start End Lots
25 10/15/2020     0
25 10/15/2020   9/15/2020   1
25 10/15/2020  

11/14/2020

2

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
newsas007
Quartz | Level 8
This scenario where there is a end date greater than entered date doesn't happen. Basically, row #3 will not exist.
ballardw
Super User

@newsas007 wrote:
This scenario where there is a end date greater than entered date doesn't happen. Basically, row #3 will not exist.

I'll say that I hope that is true.

In the data I deal with I have people that "end" a program before they "start" because of data entry errors.

And people with medical tests on dates that will not occur until several years in the future or several years before the testing program began.

 

If you have confirmed your statement, good for you. If you believe that without testing it you might be surprised. "Shouldn't" unfortunately does not mean "doesn't" in real data entry.

newsas007
Quartz | Level 8
@ballardw, That is an excellent point. Just in case this does happen, then it should be after 1.
mkeintz
PROC Star

You still haven't answered my question about what you want when the ENTERED DATE is equi-distant from the closest START and closest END.  Editted note: when equi-distant choose the END date an LOTS value.

 

But in ignorance of that objective, below is code that I believe you can use for the objectives as I understand them.

 

The strategy is to check each incoming START and END, keeping progressive track of the closest date value (CLOSEST_DATE) and its distance (CLOSEST_DIST) from ENTERED.  And also keep track of the observation number (variable PTR) containing that date.

 

Then, at the end of each ID, re-reread the observation indicated by PTR, as the sole output record for that ID.  Here's the code:

 

data have;
  input id (entered start end) (:mmddyy8.)  lots;
  format entered start end mmddyy10.;
datalines;
1       12/13/2016            .          .              1
1       12/13/2016      9/21/2016        .              2
2       12/14/2016      10/28/2015      12/28/2015      1
2       12/14/2016      2/14/2016       6/13/2016       2
2       12/14/2016      6/14/2016       11/14/2016      3
3       1/2/2017        10/15/2011      8/14/2015       0
3       1/2/2017        10/23/2015      8/13/2016       1
3       1/2/2017        8/14/2016        .              2
4       1/25/2017       4/14/2016       9/13/2016       1
4       1/25/2017       9/14/2016       12/15/2016      2
5       3/23/2017       11/14/2016      2/14/2017       0
6       4/30/2017       3/24/2016       12/18/2016      1
6       4/30/2017       1/10/2017       4/14/2017       2
7       7/17/2017            .           .              .
8       8/24/2017            .           .              .
9       1/31/2018            .           .              .
10      2/11/2018       10/12/2017      12/21/2017      -1
11      4/22/2018       9/4/2017        1/15/2018       -1
12      6/11/2018       4/14/2017       8/14/2017       1
12      6/11/2018       10/15/2017      2/14/2018       2
12      6/11/2018       3/17/2018       5/15/2018       3
13      6/7/2018             .           .              0
13      6/7/2018             .           .              1
13      6/7/2018             .           2/1/2018       2
14      7/30/2018       9/14/2017       4/30/2018       1
15      8/19/2018            .           .              .
16      8/21/2018       6/14/2018       6/15/2018       0
17      9/10/2018       4/8/2018        5/31/2018       0
18      12/12/2018      7/15/2018       11/14/2018      4
19      12/16/2018      3/27/2015       9/13/2018       0
19      12/16/2018      9/14/2018       11/14/2018      1
20      1/14/2019       11/15/2015      1/25/2017       7
21      3/20/2019       8/14/2017       1/14/2018       5
21      3/20/2019           .            .              2
22      4/23/2019       4/7/2019        4/8/2019        0
23      7/10/2019           .            .              .
run;

data want (drop=closest_: date);
  set have;
  by id;
  if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

  retain closest_distance closest_date;
  if first.id then call missing(of closest_:);

  do date=start,end;
    if date=. then continue;  ** Skip this iteration if date is missing **;
    if min(abs(entered-date),closest_distance) = abs(entered-date) then do;
      closest_date=date;
      closest_distance=abs(entered-date);
      ptr=_n_;
    end;
  end;

  if last.id;
  set have point=ptr;   ** Re-read the designated observation **;
  length status $9;
  if nmiss(start,end)=2 then status='Unknown'; else
  if lots>4 then status='After 4'; else       /** Revised **/
  status=catx(' ','After',lots);
run;

BTW, your dataset WANT has 2 observations (edited addition: for ID 19), which I presume is in error.  This code produces 1 obs per id.  

 

Comments on changed code:

After your response to the first version of this not, I modified the code above.  I added the statement

 if first.id then PTR=_N_;   /** Initialize the pointer for each ID **/

which prevents carrying forward PTR values when an ID generates an UNKNOWN status (no minimum values found).  And also added the line: 

if lots>4 then status='After 4'; else

Also, if the closest END and closest START are equi-distant from ENTERED, this will chose the last qualifying date encountered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
newsas007
Quartz | Level 8
Hi @mkeintz,
Sorry, when ENTERED DATE is equi-distant from the closest START and closest END - then it gets assigned to the end or corresponding END LOTS.
Also, this program works for the most part but the 'Unkowns' are not listed. ID#6 repeats several times as well.
Yes, i would like to have just all Unique IDs in the final data.
Thank you so much,
newsas007
Quartz | Level 8
This works great! Thank you

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2481 views
  • 1 like
  • 3 in conversation