Solved: Create a unique ID for overlapping records with a twist

HR_coder123 · Posted 06-06-2023 09:59 AM

This is a tricky one, that I have been having issues getting the desired result.

Below I have employee data where there are start dates, termination dates, 2 different review record types, and employee ids.

In the data below, I need to create a unique Id to match each B review with only one A review. The matching criteria are the following:
1) The B review must fall on or between the A review dates (same employee as well).
2) When the B review matches more than one A review, then the B review must match to the A review with the latest termination date. So if it matches 2 or more A reviews, I need it to match the one with the latest
termination date (see my example). This is where I'm running into problems.
3) The A reviews do not always have to match a B review. What is important is that the B reviews are correctly mapped to an A review. In my example, the unmatached A reviews have a null value o the unique Id flag.
4) A unique id flag to show the correct mapping of the B review back to the A review that fulfills criteria 1 and 2 above.

data have;
input employee_id rvw_type $ s_date :yymmdd10. t_Date :yymmdd10.;
format s_date yymmddd10. t_date yymmdd10.;
datalines;
1 A 2019-04-23 2019-04-25
1 A 2019-04-23 2019-04-27
1 A 2019-11-20 2019-11-24
1 A 2019-11-20 2019-11-26
1 B 2019-04-25 2019-04-25
1 B 2019-04-23 2019-04-23
1 B 2019-11-22 2019-11-22
2 A 2019-01-05 2019-01-06
2 A 2019-01-05 2019-01-08
2 B 2019-01-05 2019-01-05
2 B 2019-01-05 2019-01-06
;
run;

data want;
input _employee_id Rvw_type $ s_date :yymmdd10. t_Date :yymmdd10. Unique_match_flag;
format s_date yymmddd10. t_date yymmdd10.;
datalines;
1 A 2019-04-23 2019-04-25 .
1 A 2019-04-23 2019-04-27 1
1 A 2019-11-20 2019-11-24 .
1 A 2019-11-20 2019-11-26 2
1 B 2019-04-25 2019-04-25 1
1 B 2019-04-23 2019-04-23 1
1 B 2019-11-22 2019-11-22 2
2 A 2019-01-05 2019-01-06 .
2 A 2019-01-05 2019-01-08 3
2 B 2019-01-05 2019-01-05 3
2 B 2019-01-05 2019-01-06 3
;
run;

Tom · Posted 06-07-2023 12:28 PM

First thing I would do is add a unique identifier to the source data just to make it easier to see if it is working.

Let's call it REVIEW.

data have;
  input employee_id rvw_type $ s_date  :yymmdd10. t_Date :yymmdd10.;
  format s_date yymmddd10. t_date yymmdd10.;
  review+1;
datalines;
1 A 2019-04-23 2019-04-25
1 A 2019-04-23 2019-04-27
1 A 2019-11-20 2019-11-24
1 A 2019-11-20 2019-11-26
1 B 2019-04-25 2019-04-25
1 B 2019-04-23 2019-04-23
1 B 2019-11-22 2019-11-22
2 A 2019-01-05 2019-01-06
2 A 2019-01-05 2019-01-08
2 B 2019-01-05 2019-01-05
2 B 2019-01-05 2019-01-06
;

Then match them on whether the dates for B or are between the dates for A.

Order them by your selection criteria.

proc sql;
create table match as
select a.employee_id
     , a.review as a_review
     , a.s_date as a_s_date
     , a.t_Date as a_t_Date
     , b.review as b_review
     , b.s_date as b_s_date
     , b.t_Date as b_t_Date
from (select * from have where Rvw_type='A') a
inner join (select * from have where Rvw_type='B') b
on a.employee_id = b.employee_id
  and b.s_date between a.s_date and a.t_date
  and b.t_date between a.s_date and a.t_date
order by b.review,a.t_date 
;

Now you can limit it to the latest one.

data want;
 set match;
 by b_review ;
 if last.b_review ;
run;

Result

       employee_
Obs        id       a_review      a_s_date      a_t_Date    b_review      b_s_date      b_t_Date

 1         1            2       2019-04-23    2019-04-27        5       2019-04-25    2019-04-25
 2         1            2       2019-04-23    2019-04-27        6       2019-04-23    2019-04-23
 3         1            4       2019-11-20    2019-11-26        7       2019-11-22    2019-11-22
 4         2            9       2019-01-05    2019-01-08       10       2019-01-05    2019-01-05
 5         2            9       2019-01-05    2019-01-08       11       2019-01-05    2019-01-06

View solution in original post

Seadrago · Posted 06-06-2023 01:43 PM

I think the tricky part is the multiple rows of RVW_TYPE

in this example for EMPLOYEE_ID=1 there are 4 rows for RVW_TYPE="A" and 3 rows for RVW_TYPE="B".

Which Review Type B Date are you trying to match? the Start date or Termination Date?

More Importantly which row in review type B would you choose? Earliest and Latest depends on which variable so latest Start date? or latest End date?

You can run the below code to look at the 2 datasets you are trying to match up. In merges you usually want primary keys with no repeat rows. So think about what you are trying to match. Since the A data seems to be the master data in your example, think about how to make the B data into single row per subject.

data ds_a ds_b;

set have;

if rvw_type="A" then output ds_a;

if rvw_type="B" then output ds_b;

run;

HR_coder123 · Posted 06-06-2023 03:28 PM

There will be multiple B reviews that can match with a A review. However, sometimes the B reviews map to multiple A reviews. I need them to map in a specific fashion.

What I need though is for the B reviews to match with a A review where the following occurs:

#1) The B reviews start and termination date are within the start and termination date of the A review.

#2) The B review is paired with the A review that has the latest termination date. There can be multiple B reviews mapping to one A review. I just need to be sure that they map to the A review with the "latest" termination date.

I think I have a roundabout solution to this. I will post if if I get it.

Seadrago · Posted 06-07-2023 11:09 AM

So I think you should try to fulfill the #2 requirement first.

data ds_a ds_b;

set have;

if rvw_type="A" then output ds_a;

if rvw_type="B" then output ds_b;

run;

proc sort data=ds_a;

by employee_id t_date;

run;

/*Below dataset will give you one row per ID for the latest or last termination date*/

data ds_a1;

set ds_a;

by employee_id t_date;

if last.empoyee_id;

run;

/* One you have dataset with single row per subject it becomes easier to merge and match*/

data ds_b1;

set ds_b;

/*rename the date variables in b for clarity*/

s_date_b=s_date;

t_date_b=t_date;

drop s_date t_date;

run;

proc sort data=ds_b1; by employee_id;

data match;

merge ds_a1 ds_b1;

by employee_id;

/*fulfill first condition here b date need to be within time window of a date*/

if s_date_b>=s_date and t_date_b<=t_date then unique_match_flag=1;

run;

Tom · Posted 06-07-2023 12:28 PM

First thing I would do is add a unique identifier to the source data just to make it easier to see if it is working.

Let's call it REVIEW.

data have;
  input employee_id rvw_type $ s_date  :yymmdd10. t_Date :yymmdd10.;
  format s_date yymmddd10. t_date yymmdd10.;
  review+1;
datalines;
1 A 2019-04-23 2019-04-25
1 A 2019-04-23 2019-04-27
1 A 2019-11-20 2019-11-24
1 A 2019-11-20 2019-11-26
1 B 2019-04-25 2019-04-25
1 B 2019-04-23 2019-04-23
1 B 2019-11-22 2019-11-22
2 A 2019-01-05 2019-01-06
2 A 2019-01-05 2019-01-08
2 B 2019-01-05 2019-01-05
2 B 2019-01-05 2019-01-06
;

Then match them on whether the dates for B or are between the dates for A.

Order them by your selection criteria.

proc sql;
create table match as
select a.employee_id
     , a.review as a_review
     , a.s_date as a_s_date
     , a.t_Date as a_t_Date
     , b.review as b_review
     , b.s_date as b_s_date
     , b.t_Date as b_t_Date
from (select * from have where Rvw_type='A') a
inner join (select * from have where Rvw_type='B') b
on a.employee_id = b.employee_id
  and b.s_date between a.s_date and a.t_date
  and b.t_date between a.s_date and a.t_date
order by b.review,a.t_date 
;

Now you can limit it to the latest one.

data want;
 set match;
 by b_review ;
 if last.b_review ;
run;

Result

       employee_
Obs        id       a_review      a_s_date      a_t_Date    b_review      b_s_date      b_t_Date

 1         1            2       2019-04-23    2019-04-27        5       2019-04-25    2019-04-25
 2         1            2       2019-04-23    2019-04-27        6       2019-04-23    2019-04-23
 3         1            4       2019-11-20    2019-11-26        7       2019-11-22    2019-11-22
 4         2            9       2019-01-05    2019-01-08       10       2019-01-05    2019-01-05
 5         2            9       2019-01-05    2019-01-08       11       2019-01-05    2019-01-06

HR_coder123 · Posted 06-07-2023 05:11 PM

I ended up taking a very similar route except using 2 unique IDs for both A and B, but essentially ended up at the same place with the data step at the end as you have proposed (using the LAST statement to get rid of the duplicates from the cartesian product).

The other way I did this (very similar as well), was to just merge the A_termination dates directly onto the B records, and then use the LAST statement to get rid of the cartesian product, and finally create a variable that concatenated the employee Id and last termination date into one.

Thanks again for your help everyone!

FreelanceReinh · Posted 06-07-2023 03:35 PM

Hello @HR_coder123,

Another option is to create a sorted version of dataset HAVE (call it HAVE_S) so that the A-review observation with the latest termination date (among the matching observations) is the first that will be found if the other criteria regarding the dates are checked in a DATA step (reading HAVE_S with a SET statement). Of course, the actual values of the unique IDs (see variable UID below) depend on the sort order.

proc sort data=have out=have_s;
by employee_id rvw_type descending t_date s_date;
run;

data want(drop=_:);
if _n_=1 then do;
  dcl hash h(multidata:'y');
  h.definekey('rvw_type');
  h.definedata('rvw_type','s_date','t_date','uid');
  h.definedone();
  dcl hiter hi('h');
end;
do until(last.employee_id);
  set have_s;
  by employee_id;
  if rvw_type='A' then h.add();
  else if rvw_type='B' then do;
    _r=rvw_type;
    _s=s_date;
    _t=t_date;
    _u=.;
    h.reset_dup();
    do while(h.do_over(key:'A')=0);
      if s_date<=_s & _t<=t_date then do;
        if uid=. then do;
          _c+1;
          uid=_c;
          h.replacedup();
        end;
        _u=uid;
        leave;
      end;
    end;
    h.add(key:_r,data:_r,data:_s,data:_t,data:_u);
  end;
end;
do while(hi.next()=0);
  output;
end;
h.clear();
run;

Create a unique ID for overlapping records with a twist

Re: Create a unique ID for overlapping records with a twist

Re: Create a unique ID for overlapping records with a twist

Re: Create a unique ID for overlapping records with a twist

Re: Create a unique ID for overlapping records with a twist

Re: Create a unique ID for overlapping records with a twist

Re: Create a unique ID for overlapping records with a twist

Re: Create a unique ID for overlapping records with a twist

Registration is open

SAS Training: Just a Click Away