Hello All,
I am trying to merge two datasets. I have
Data A:
id_a Date_a Cycle_a
1 2015 2016
1 2016 2016
1 2017 2018
1 2018 2018
2 2015 2016
2 2016 2016
2 2017 2018
3 2013 2014
Data B:
id_b Date_b
1 2015
1 2016
1 2017
2 2016
2 2017
3 2018
I want
id_a Date_a Cycle_a id_b Date_b
1 2015 2016 1 2015
1 2016 2016 1 2016
1 2017 2018 1 2017
1 2018 2018 1 2017
2 2015 2016 2 2016
2 2016 2016 2 2016
2 2017 2018 2 2017
I highlighted the problematic dates in red. I am trying to merge using two possibilities of the date condition along with the id. So I was using the following code:
proc sql; create table want as select * from Data_a as a Join Data_b as b on (a.Cycle_a=b.Date_b or (a.Cycle_a-1)=b.Date_b) and a.id_a=b.id_b; quit;
However, using this code produce duplicates and it is not accurate to just drop duplicates as I may drop good matches.
I would like the program to perform this part first (a.Cycle_a=b.Date_b) and if there is no match it goes the second part ((a.Cycle_a-1)=b.Date_b). If it does not match then drop the observation (e.g. the observation with id =3 in my example above).
Thanks!
Here is another approach, using a double merge if you'd like to try playing with the code. It's a larger number of steps, but each step is fairly simple, and we may get better results this way.
I'm completely confused as to what exactly the requirements are for this code, so I will turn the code over to you, and you should then adjust the code as necessary.
Jim
Data A;
INPUT ID_A Date_A Cycle_A;
Cycle_A_1 = Cycle_A - 1;
DATALINES;
1 2015 2016
1 2016 2016
1 2017 2018
1 2018 2018
2 2015 2016
2 2016 2016
2 2017 2018
3 2013 2014
;
RUN;
**------------------------------------------------------------------------------**;
PROC SORT DATA=A
Out =A1;
BY ID_A Cycle_A Date_A;
RUN;
**------------------------------------------------------------------------------**;
PROC SORT DATA=A
Out =A2;
BY ID_A Cycle_A_1 Date_A;
RUN;
**------------------------------------------------------------------------------**;
Data B;
INPUT ID_B Date_B;
DATALINES;
1 2015
1 2016
1 2017
2 2016
2 2017
3 2018
;
RUN;
**------------------------------------------------------------------------------**;
PROC SORT DATA=B;
BY ID_B Date_B;
RUN;
**------------------------------------------------------------------------------**;
DATA A1_B;
DROP _:;
FORMAT ID Date_A Cycle_A Date_B Cycle_A_1;
LENGTH Match $10;
MERGE A1 (IN=A RENAME=(ID_A=ID Cycle_A=Date_B))
B (IN=B RENAME=(ID_B=ID));
BY ID Date_B;
IF A AND B;
Cycle_A = Date_B;
Match = 'A B Direct';
RUN;
**------------------------------------------------------------------------------**;
DATA A2_B;
DROP _:;
FORMAT ID Date_A Cycle_A Date_B Cycle_A_1;
LENGTH Match $10;
MERGE A2 (IN=A RENAME=(ID_A=ID Cycle_A_1=Date_B))
B (IN=B RENAME=(ID_B=ID));
BY ID Date_B;
IF A AND B;
Cycle_A_1 = Date_B;
Match = 'A-1 B';
RUN;
**------------------------------------------------------------------------------**;
PROC SQL NOPRINT;
CREATE TABLE A_B_Combined AS
SELECT * FROM A1_B
UNION
SELECT * FROM A2_B
;
QUIT;
**------------------------------------------------------------------------------**;
** Pre-Sort. **;
**------------------------------------------------------------------------------**;
PROC SORT DATA=A_B_Combined;
BY _ALL_;
RUN;
**------------------------------------------------------------------------------**;
** Sort de-dup. **;
**------------------------------------------------------------------------------**;
PROC SORT DATA=A_B_Combined NODUPKEY
OUT =A_B_Final;
BY ID Date_A Cycle_A Date_B;
RUN;
Hmmm. Well, this is a bit tricky. I have a program that produces the results you want (see bottom of this post). I switched from a Join to a Merge because there's more detail level control with a Merge. The program works on this set of data, but this is a tricky enough problem that I'd like you to really test the code on more data before you use it on anything that you'd be presenting to clients, employers, etc. I think I've got it, but I might have missed something. Testing with more data is strongly recommended.
Let me know if you come up with any testing abnormalities, and I'll take a look.
Jim
Here are the results.
And here is the program. I'm using a modified version of the Hop macro by @LeonidBatkhan to read ahead in the SAS dataset in order to make sure I'm getting all the matches.
Data A;
INPUT ID_A Date_A Cycle_A;
_Cycle_A_1 = Cycle_A - 1;
DATALINES;
1 2015 2016
1 2016 2016
1 2017 2018
1 2018 2018
2 2015 2016
2 2016 2016
2 2017 2018
3 2013 2014
;
RUN;
**------------------------------------------------------------------------------**;
PROC SORT DATA=A;
BY ID_A Cycle_A;
RUN;
**------------------------------------------------------------------------------**;
Data B;
INPUT ID_B Date_B;
DATALINES;
1 2015
1 2016
1 2017
2 2016
2 2017
3 2018
;
RUN;
**------------------------------------------------------------------------------**;
PROC SORT DATA=B;
BY ID_B Date_B;
RUN;
**------------------------------------------------------------------------------**;
** Macro Get_Var_By_Pointer does a read based on a pointer and can read ahead **;
** to the next record (or records) or read previous. Based on the Hop macro **;
** by Leonid Batkhan. **;
**------------------------------------------------------------------------------**;
%MACRO Get_Var_By_Pointer(d, x, y, j);
_p_ = _n_ + &j;
IF (1 <= _p_ <= _o_) THEN
DO;
_Found = 1;
SET &d(KEEP=&x RENAME=(&x=&y)) POINT=_p_ NOBS=_o_;
END;
ELSE
DO;
_Found = 0;
END;
%MEND Get_Var_By_Pointer;
**------------------------------------------------------------------------------**;
DATA A_B;
DROP _:;
FORMAT ID Date_A Cycle_A Date_B _Cycle_A_1;
MERGE A (IN=A RENAME=(ID_A=ID))
B (IN=B RENAME=(ID_B=ID));
BY ID;
IF A AND B;
IF Cycle_A = Date_B THEN
OUTPUT;
ELSE
IF _Cycle_A_1 = Date_B THEN
OUTPUT;
%Get_Var_By_Pointer(A, ID_A, _ID, +1);
IF _Found AND
ID = _ID THEN
DO;
%Get_Var_By_Pointer(A, Cycle_A, Cycle_A, +1);
%Get_Var_By_Pointer(A, Date_A, Date_A, +1);
IF _Found THEN
IF Date_B = Cycle_A THEN
OUTPUT;
END;
DELETE;
RUN;
**------------------------------------------------------------------------------**;
** This Sort de-dup is not necessary with the data I have, but just in case. **;
**------------------------------------------------------------------------------**;
PROC SORT DATA=A_B NODUPKEY;
BY _ALL_;
RUN;
The red 2017 for ID1 does not match any value.
Of course! Maybe I should learn to read!
Unless the volumes are large and performance is an issue, I think it is best for ease of maintenance and legibility to split the joins.
proc sql;
create table T1 as
select a.*, b.DATE_B
from A left join B
on ID_A=ID_B and CYCLE_A=DATE_B;
create table T2 as
select a.ID_A ,a.Date_A, a.Cycle_A, coalesce(a.DATE_B,b.DATE_B)as DATE_B
from T1 a left join B
on ID_A=ID_B and CYCLE_A-1=b.DATE_B
where calculated DATE_B ne .;
I made the following adjustments to your code and it seems to work but I need more time to make sure it isn't missing some.
proc sql;
create table T1 as
select a.*, b.DATE_B
from A left join B
on ID_A=ID_B and Date_A=DATE_B;
create table T2 as
select a.*,b.*, coalesce(a.DATE_B,b.DATE_B)as DATE_B
from T1 a left join B
on ID_A=ID_B and CYCLE_A-1=b.DATE_B
where calculated DATE_B ne .;
quit;
You can make T1 a view, or a subquery. It might help performance. Or not.
OK, good. Then I understood correctly.
Did you get a chance to try the code I posted earlier? If so, how did it go?
Jim
Yes I did and I am trying to understand why it is giving me only about 13,700 observations and not all these observations are matched correctly. I am expecting a number of observations somewhere between 65,000 and 68,000. The code posted by @ChrisNZ is giving about 67,000 observations but it does not correctly match on both criteria.
I cannot post the data because of property rights but I will try to write another reply with some more details that might help.
Can you give me some test data of matches that are of the type that my logic is not identifying?
Jim
For this specific id (00029) I have date_B 1999 and 2000 from data B. But as you can see it is not picking this direct match. The one highlighted in yellow should be 1999 and the ones marked with red should be 2001.
Earlier, I was trying the following code to get first the easy (or direct) matches where date_A equals date_B.
proc sql;
create table A_B as
select *
from
A as a
Join B as b
on
a.Date_A=b.Date_B
and a.id_A=b.id_B;
quit;
but then I do not know how to get the ones that does not have direct matches like this. When there is no direct match, then I want to perform the second criteria where I try to find a match within the cycle. Cycles are even numbers (2018, 2016, 2014, 2012, ...etc) each cycle contains two years. For example cycle 2016 contains 2016 and 2015. If Date_A shows 2015 the cycle has to be 2016 and if it shows 2016 the cycle is 2016.
I think I'd have to see the source data in order to understand.
Jim
Thank you, @jimbarbour ! I really appreciate it. I would be happy to share screen through Microsoft Teams if you'd prefer.
first the easy (or direct) matches where date_A equals date_B
You never said anything about matching these 2 fields before.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.