This is a tricky one, that I have been having issues getting the desired result.
Below I have employee data where there are start dates, termination dates, 2 different review record types, and employee ids.
In the data below, I need to create a unique Id to match each B review with only one A review. The matching criteria are the following:
1) The B review must fall on or between the A review dates (same employee as well).
2) When the B review matches more than one A review, then the B review must match to the A review with the latest termination date. So if it matches 2 or more A reviews, I need it to match the one with the latest
termination date (see my example). This is where I'm running into problems.
3) The A reviews do not always have to match a B review. What is important is that the B reviews are correctly mapped to an A review. In my example, the unmatached A reviews have a null value o the unique Id flag.
4) A unique id flag to show the correct mapping of the B review back to the A review that fulfills criteria 1 and 2 above.
data have;
input employee_id rvw_type $ s_date :yymmdd10. t_Date :yymmdd10.;
format s_date yymmddd10. t_date yymmdd10.;
datalines;
1 A 2019-04-23 2019-04-25
1 A 2019-04-23 2019-04-27
1 A 2019-11-20 2019-11-24
1 A 2019-11-20 2019-11-26
1 B 2019-04-25 2019-04-25
1 B 2019-04-23 2019-04-23
1 B 2019-11-22 2019-11-22
2 A 2019-01-05 2019-01-06
2 A 2019-01-05 2019-01-08
2 B 2019-01-05 2019-01-05
2 B 2019-01-05 2019-01-06
;
run;
data want;
input _employee_id Rvw_type $ s_date :yymmdd10. t_Date :yymmdd10. Unique_match_flag;
format s_date yymmddd10. t_date yymmdd10.;
datalines;
1 A 2019-04-23 2019-04-25 .
1 A 2019-04-23 2019-04-27 1
1 A 2019-11-20 2019-11-24 .
1 A 2019-11-20 2019-11-26 2
1 B 2019-04-25 2019-04-25 1
1 B 2019-04-23 2019-04-23 1
1 B 2019-11-22 2019-11-22 2
2 A 2019-01-05 2019-01-06 .
2 A 2019-01-05 2019-01-08 3
2 B 2019-01-05 2019-01-05 3
2 B 2019-01-05 2019-01-06 3
;
run;
First thing I would do is add a unique identifier to the source data just to make it easier to see if it is working.
Let's call it REVIEW.
data have;
input employee_id rvw_type $ s_date :yymmdd10. t_Date :yymmdd10.;
format s_date yymmddd10. t_date yymmdd10.;
review+1;
datalines;
1 A 2019-04-23 2019-04-25
1 A 2019-04-23 2019-04-27
1 A 2019-11-20 2019-11-24
1 A 2019-11-20 2019-11-26
1 B 2019-04-25 2019-04-25
1 B 2019-04-23 2019-04-23
1 B 2019-11-22 2019-11-22
2 A 2019-01-05 2019-01-06
2 A 2019-01-05 2019-01-08
2 B 2019-01-05 2019-01-05
2 B 2019-01-05 2019-01-06
;
Then match them on whether the dates for B or are between the dates for A.
Order them by your selection criteria.
proc sql;
create table match as
select a.employee_id
, a.review as a_review
, a.s_date as a_s_date
, a.t_Date as a_t_Date
, b.review as b_review
, b.s_date as b_s_date
, b.t_Date as b_t_Date
from (select * from have where Rvw_type='A') a
inner join (select * from have where Rvw_type='B') b
on a.employee_id = b.employee_id
and b.s_date between a.s_date and a.t_date
and b.t_date between a.s_date and a.t_date
order by b.review,a.t_date
;
Now you can limit it to the latest one.
data want;
set match;
by b_review ;
if last.b_review ;
run;
Result
employee_ Obs id a_review a_s_date a_t_Date b_review b_s_date b_t_Date 1 1 2 2019-04-23 2019-04-27 5 2019-04-25 2019-04-25 2 1 2 2019-04-23 2019-04-27 6 2019-04-23 2019-04-23 3 1 4 2019-11-20 2019-11-26 7 2019-11-22 2019-11-22 4 2 9 2019-01-05 2019-01-08 10 2019-01-05 2019-01-05 5 2 9 2019-01-05 2019-01-08 11 2019-01-05 2019-01-06
I think the tricky part is the multiple rows of RVW_TYPE
in this example for EMPLOYEE_ID=1 there are 4 rows for RVW_TYPE="A" and 3 rows for RVW_TYPE="B".
Which Review Type B Date are you trying to match? the Start date or Termination Date?
More Importantly which row in review type B would you choose? Earliest and Latest depends on which variable so latest Start date? or latest End date?
You can run the below code to look at the 2 datasets you are trying to match up. In merges you usually want primary keys with no repeat rows. So think about what you are trying to match. Since the A data seems to be the master data in your example, think about how to make the B data into single row per subject.
data ds_a ds_b;
set have;
if rvw_type="A" then output ds_a;
if rvw_type="B" then output ds_b;
run;
There will be multiple B reviews that can match with a A review. However, sometimes the B reviews map to multiple A reviews. I need them to map in a specific fashion.
What I need though is for the B reviews to match with a A review where the following occurs:
#1) The B reviews start and termination date are within the start and termination date of the A review.
#2) The B review is paired with the A review that has the latest termination date. There can be multiple B reviews mapping to one A review. I just need to be sure that they map to the A review with the "latest" termination date.
I think I have a roundabout solution to this. I will post if if I get it.
So I think you should try to fulfill the #2 requirement first.
data ds_a ds_b;
set have;
if rvw_type="A" then output ds_a;
if rvw_type="B" then output ds_b;
run;
proc sort data=ds_a;
by employee_id t_date;
run;
/*Below dataset will give you one row per ID for the latest or last termination date*/
data ds_a1;
set ds_a;
by employee_id t_date;
if last.empoyee_id;
run;
/* One you have dataset with single row per subject it becomes easier to merge and match*/
data ds_b1;
set ds_b;
/*rename the date variables in b for clarity*/
s_date_b=s_date;
t_date_b=t_date;
drop s_date t_date;
run;
proc sort data=ds_b1; by employee_id;
data match;
merge ds_a1 ds_b1;
by employee_id;
/*fulfill first condition here b date need to be within time window of a date*/
if s_date_b>=s_date and t_date_b<=t_date then unique_match_flag=1;
run;
First thing I would do is add a unique identifier to the source data just to make it easier to see if it is working.
Let's call it REVIEW.
data have;
input employee_id rvw_type $ s_date :yymmdd10. t_Date :yymmdd10.;
format s_date yymmddd10. t_date yymmdd10.;
review+1;
datalines;
1 A 2019-04-23 2019-04-25
1 A 2019-04-23 2019-04-27
1 A 2019-11-20 2019-11-24
1 A 2019-11-20 2019-11-26
1 B 2019-04-25 2019-04-25
1 B 2019-04-23 2019-04-23
1 B 2019-11-22 2019-11-22
2 A 2019-01-05 2019-01-06
2 A 2019-01-05 2019-01-08
2 B 2019-01-05 2019-01-05
2 B 2019-01-05 2019-01-06
;
Then match them on whether the dates for B or are between the dates for A.
Order them by your selection criteria.
proc sql;
create table match as
select a.employee_id
, a.review as a_review
, a.s_date as a_s_date
, a.t_Date as a_t_Date
, b.review as b_review
, b.s_date as b_s_date
, b.t_Date as b_t_Date
from (select * from have where Rvw_type='A') a
inner join (select * from have where Rvw_type='B') b
on a.employee_id = b.employee_id
and b.s_date between a.s_date and a.t_date
and b.t_date between a.s_date and a.t_date
order by b.review,a.t_date
;
Now you can limit it to the latest one.
data want;
set match;
by b_review ;
if last.b_review ;
run;
Result
employee_ Obs id a_review a_s_date a_t_Date b_review b_s_date b_t_Date 1 1 2 2019-04-23 2019-04-27 5 2019-04-25 2019-04-25 2 1 2 2019-04-23 2019-04-27 6 2019-04-23 2019-04-23 3 1 4 2019-11-20 2019-11-26 7 2019-11-22 2019-11-22 4 2 9 2019-01-05 2019-01-08 10 2019-01-05 2019-01-05 5 2 9 2019-01-05 2019-01-08 11 2019-01-05 2019-01-06
I ended up taking a very similar route except using 2 unique IDs for both A and B, but essentially ended up at the same place with the data step at the end as you have proposed (using the LAST statement to get rid of the duplicates from the cartesian product).
The other way I did this (very similar as well), was to just merge the A_termination dates directly onto the B records, and then use the LAST statement to get rid of the cartesian product, and finally create a variable that concatenated the employee Id and last termination date into one.
Thanks again for your help everyone!
Hello @HR_coder123,
Another option is to create a sorted version of dataset HAVE (call it HAVE_S) so that the A-review observation with the latest termination date (among the matching observations) is the first that will be found if the other criteria regarding the dates are checked in a DATA step (reading HAVE_S with a SET statement). Of course, the actual values of the unique IDs (see variable UID below) depend on the sort order.
proc sort data=have out=have_s;
by employee_id rvw_type descending t_date s_date;
run;
data want(drop=_:);
if _n_=1 then do;
dcl hash h(multidata:'y');
h.definekey('rvw_type');
h.definedata('rvw_type','s_date','t_date','uid');
h.definedone();
dcl hiter hi('h');
end;
do until(last.employee_id);
set have_s;
by employee_id;
if rvw_type='A' then h.add();
else if rvw_type='B' then do;
_r=rvw_type;
_s=s_date;
_t=t_date;
_u=.;
h.reset_dup();
do while(h.do_over(key:'A')=0);
if s_date<=_s & _t<=t_date then do;
if uid=. then do;
_c+1;
uid=_c;
h.replacedup();
end;
_u=uid;
leave;
end;
end;
h.add(key:_r,data:_r,data:_s,data:_t,data:_u);
end;
end;
do while(hi.next()=0);
output;
end;
h.clear();
run;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.