Solved: Re: merge on two possibilities of one conditions

azsham · Posted 10-07-2020 10:12 AM

Hello All,

I am trying to merge two datasets. I have

Data A:

id_a Date_a Cycle_a

1 2015 2016

1 2016 2016

1 2017 2018

1 2018 2018

2 2015 2016

2 2016 2016

2 2017 2018

3 2013 2014

Data B:

id_b Date_b

1 2015

1 2016

1 2017

2 2016

2 2017

3 2018

I want

id_a Date_a Cycle_a id_b Date_b

1 2015 2016 1 2015

1 2016 2016 1 2016

1 2017 2018 1 2017

1 2018 2018 1 2017

2 2015 2016 2 2016

2 2016 2016 2 2016

2 2017 2018 2 2017

I highlighted the problematic dates in red. I am trying to merge using two possibilities of the date condition along with the id. So I was using the following code:

proc sql; create table want as select *
from Data_a as a
Join Data_b as b 
on (a.Cycle_a=b.Date_b or (a.Cycle_a-1)=b.Date_b)
and a.id_a=b.id_b;
quit;

However, using this code produce duplicates and it is not accurate to just drop duplicates as I may drop good matches.

I would like the program to perform this part first (a.Cycle_a=b.Date_b) and if there is no match it goes the second part ((a.Cycle_a-1)=b.Date_b). If it does not match then drop the observation (e.g. the observation with id =3 in my example above).

Thanks!

jimbarbour · Posted 10-08-2020 01:23 AM

@azsham,

Here is another approach, using a double merge if you'd like to try playing with the code. It's a larger number of steps, but each step is fairly simple, and we may get better results this way.

I'm completely confused as to what exactly the requirements are for this code, so I will turn the code over to you, and you should then adjust the code as necessary.

Jim

Data A;
	INPUT	ID_A     Date_A    Cycle_A;

	Cycle_A_1	=	Cycle_A - 1;
DATALINES;
1           2015         2016
1           2016         2016
1           2017         2018
1           2018         2018
2           2015         2016
2           2016         2016
2           2017         2018
3           2013         2014
;
RUN;
 
**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A
				Out	=A1;
	BY	ID_A Cycle_A	Date_A;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A
				Out	=A2;
	BY	ID_A Cycle_A_1	Date_A;
RUN;

**------------------------------------------------------------------------------**;

Data B;
	INPUT	ID_B     Date_B;

DATALINES;
1           2015
1           2016
1           2017
2           2016
2           2017
3           2018
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=B;
	BY	ID_B Date_B;
RUN;

**------------------------------------------------------------------------------**;

DATA	A1_B;
	DROP	_:;

	FORMAT	ID Date_A Cycle_A Date_B Cycle_A_1;
	LENGTH	Match	$10;

	MERGE	A1	(IN=A	RENAME=(ID_A=ID Cycle_A=Date_B))
			B	(IN=B 	RENAME=(ID_B=ID));
		BY	ID	Date_B;

	IF	A	AND	B;

	Cycle_A		=	Date_B;
	Match		=	'A B Direct';
RUN;

**------------------------------------------------------------------------------**;

DATA	A2_B;
	DROP	_:;

	FORMAT	ID Date_A Cycle_A Date_B Cycle_A_1;
	LENGTH	Match	$10;

	MERGE	A2	(IN=A	RENAME=(ID_A=ID Cycle_A_1=Date_B))
			B	(IN=B 	RENAME=(ID_B=ID));
		BY	ID	Date_B;

	IF	A	AND	B;

	Cycle_A_1	=	Date_B;
	Match		=	'A-1 B';
RUN;

**------------------------------------------------------------------------------**;

PROC	SQL	NOPRINT;
	CREATE	TABLE	A_B_Combined	AS
		SELECT	*	FROM	A1_B
			UNION
		SELECT	*	FROM	A2_B
		;
QUIT;

**------------------------------------------------------------------------------**;
**	Pre-Sort.																	**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B_Combined;
	BY	_ALL_;
RUN;

**------------------------------------------------------------------------------**;
**	Sort de-dup.																**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B_Combined	NODUPKEY
				OUT	=A_B_Final;
	BY	ID Date_A Cycle_A Date_B;
RUN;

View solution in original post

jimbarbour · Posted 10-07-2020 05:23 PM

Hmmm. Well, this is a bit tricky. I have a program that produces the results you want (see bottom of this post). I switched from a Join to a Merge because there's more detail level control with a Merge. The program works on this set of data, but this is a tricky enough problem that I'd like you to really test the code on more data before you use it on anything that you'd be presenting to clients, employers, etc. I think I've got it, but I might have missed something. Testing with more data is strongly recommended.

Let me know if you come up with any testing abnormalities, and I'll take a look.

Jim

Here are the results.

And here is the program. I'm using a modified version of the Hop macro by @LeonidBatkhan to read ahead in the SAS dataset in order to make sure I'm getting all the matches.

Data A;
	INPUT	ID_A     Date_A    Cycle_A;

	_Cycle_A_1	=	Cycle_A - 1;
DATALINES;
1           2015         2016
1           2016         2016
1           2017         2018
1           2018         2018
2           2015         2016
2           2016         2016
2           2017         2018
3           2013         2014
;
RUN;
 
**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A;
	BY	ID_A Cycle_A;
RUN;

**------------------------------------------------------------------------------**;

Data B;
	INPUT	ID_B     Date_B;

DATALINES;
1           2015
1           2016
1           2017
2           2016
2           2017
3           2018
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=B;
	BY	ID_B Date_B;
RUN;

**------------------------------------------------------------------------------**;
**	Macro Get_Var_By_Pointer does a read based on a pointer and can read ahead	**;
**	to the next record (or records) or read previous.  Based on the Hop macro	**;
**	by Leonid Batkhan.															**;
**------------------------------------------------------------------------------**;
%MACRO	Get_Var_By_Pointer(d, x, y, j);
	_p_	=	_n_ + &j;
	IF	(1 <= _p_ <= _o_)	THEN
		DO;
			_Found		=	1;
			SET	&d(KEEP=&x RENAME=(&x=&y)) POINT=_p_ NOBS=_o_;
		END;
	ELSE
		DO;
			_Found		=	0;
		END;
%MEND	Get_Var_By_Pointer;

**------------------------------------------------------------------------------**;

DATA	A_B;
	DROP	_:;

	FORMAT	ID Date_A Cycle_A Date_B _Cycle_A_1;

	MERGE	A	(IN=A RENAME=(ID_A=ID))
			B	(IN=B RENAME=(ID_B=ID));
		BY	ID;

	IF	A	AND	B;

	IF	Cycle_A				=	Date_B			THEN
		OUTPUT;
	ELSE
	IF	_Cycle_A_1			=	Date_B			THEN
		OUTPUT;

	%Get_Var_By_Pointer(A, ID_A, _ID, +1);
	IF	_Found									AND
		ID					=	_ID				THEN
		DO;
			%Get_Var_By_Pointer(A, Cycle_A,		Cycle_A,	+1);
			%Get_Var_By_Pointer(A, Date_A,		Date_A,		+1);
			IF	_Found							THEN
				IF	Date_B	=	Cycle_A			THEN
					OUTPUT;
		END;

	DELETE;
RUN;

**------------------------------------------------------------------------------**;
**	This Sort de-dup is not necessary with the data I have, but just in case.	**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B	NODUPKEY;
	BY	_ALL_;
RUN;

ChrisNZ · Posted 10-07-2020 08:52 PM

The red 2017 for ID1 does not match any value.

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 10-07-2020 09:02 PM

@ChrisNZ.

This is a rather odd merge. @azsham has not one but two criteria for a date match:

Date_B = Cycle_A
Date_B = (Cycle_A - 1)

I believe the red 2017 matches based on criteria 2, yes?

Jim

ChrisNZ · Posted 10-07-2020 09:44 PM

Of course! Maybe I should learn to read!

Unless the volumes are large and performance is an issue, I think it is best for ease of maintenance and legibility to split the joins.

proc sql;
  create table T1 as
  select a.*, b.DATE_B 
  from A left join B
  on ID_A=ID_B and CYCLE_A=DATE_B; 

  create table T2 as
  select a.ID_A ,a.Date_A, a.Cycle_A, coalesce(a.DATE_B,b.DATE_B)as DATE_B
  from T1 a left join B
  on ID_A=ID_B and CYCLE_A-1=b.DATE_B
  where calculated DATE_B ne .;

High-Performance SAS Coding - Third Edition

azsham · Posted 10-07-2020 11:07 PM

@ChrisNZ

I made the following adjustments to your code and it seems to work but I need more time to make sure it isn't missing some.

proc sql;
  create table T1 as
  select a.*, b.DATE_B 
  from A left  join B
  on ID_A=ID_B and Date_A=DATE_B; 

  create table T2 as
  select a.*,b.*, coalesce(a.DATE_B,b.DATE_B)as DATE_B
  from T1 a left join B
  on ID_A=ID_B and CYCLE_A-1=b.DATE_B
  where calculated DATE_B ne .; 
quit;

ChrisNZ · Posted 10-07-2020 11:21 PM

You can make T1 a view, or a subquery. It might help performance. Or not.

High-Performance SAS Coding - Third Edition

azsham · Posted 10-07-2020 09:46 PM

Hello All,

That is right @jimbarbour. The red 2017 matches on criteria 2

jimbarbour · Posted 10-07-2020 09:51 PM

@azsham,

OK, good. Then I understood correctly.

Did you get a chance to try the code I posted earlier? If so, how did it go?

Jim

azsham · Posted 10-07-2020 10:02 PM

Yes I did and I am trying to understand why it is giving me only about 13,700 observations and not all these observations are matched correctly. I am expecting a number of observations somewhere between 65,000 and 68,000. The code posted by @ChrisNZ is giving about 67,000 observations but it does not correctly match on both criteria.

I cannot post the data because of property rights but I will try to write another reply with some more details that might help.

jimbarbour · Posted 10-07-2020 10:13 PM

Can you give me some test data of matches that are of the type that my logic is not identifying?

Jim

azsham · Posted 10-07-2020 10:36 PM

For this specific id (00029) I have date_B 1999 and 2000 from data B. But as you can see it is not picking this direct match. The one highlighted in yellow should be 1999 and the ones marked with red should be 2001.

Earlier, I was trying the following code to get first the easy (or direct) matches where date_A equals date_B.

proc sql;
create table A_B as
select *
from
A as a
Join B as b 
on
a.Date_A=b.Date_B
and a.id_A=b.id_B;
quit;

but then I do not know how to get the ones that does not have direct matches like this. When there is no direct match, then I want to perform the second criteria where I try to find a match within the cycle. Cycles are even numbers (2018, 2016, 2014, 2012, ...etc) each cycle contains two years. For example cycle 2016 contains 2016 and 2015. If Date_A shows 2015 the cycle has to be 2016 and if it shows 2016 the cycle is 2016.

jimbarbour · Posted 10-07-2020 10:52 PM

I think I'd have to see the source data in order to understand.

Jim

azsham · Posted 10-07-2020 11:05 PM

Thank you, @jimbarbour ! I really appreciate it. I would be happy to share screen through Microsoft Teams if you'd prefer.

ChrisNZ · Posted 10-07-2020 11:06 PM

first the easy (or direct) matches where date_A equals date_B

You never said anything about matching these 2 fields before.

High-Performance SAS Coding - Third Edition

SAS Innovate 2025: Call for Content

Classroom Training Available!