Obsidian | Level 7

## merge on two possibilities of one conditions

Hello All,

I am trying to merge two datasets. I have

Data A:

id_a     Date_a    Cycle_a

1           2015         2016

1           2016         2016

1           2017         2018

1           2018         2018

2           2015         2016

2           2016         2016

2           2017         2018

3           2013         2014

Data B:

id_b     Date_b

1           2015

1           2016

1           2017

2           2016

2           2017

3           2018

I want

id_a     Date_a    Cycle_a     id_b     Date_b

1           2015         2016        1          2015

1           2016         2016        1          2016

1           2017         2018        1          2017

1           2018         2018        1          2017

2           2015         2016        2          2016

2           2016         2016        2          2016

2           2017         2018        2          2017

I highlighted the problematic dates in red. I am trying to merge using two possibilities of the date condition along with the id. So I was using the following code:

```proc sql; create table want as select *
from Data_a as a
Join Data_b as b
on (a.Cycle_a=b.Date_b or (a.Cycle_a-1)=b.Date_b)
and a.id_a=b.id_b;
quit;```

However, using this code produce duplicates and it is not accurate to just drop duplicates as I may drop good matches.

I would like the program to perform this part first (a.Cycle_a=b.Date_b) and if there is no match it goes the second part ((a.Cycle_a-1)=b.Date_b). If it does not match then drop the observation (e.g. the observation with id =3 in my example above).

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Meteorite | Level 14

## Re: merge on two possibilities of one conditions

Here is another approach, using a double merge if you'd like to try playing with the code.  It's a larger number of steps, but each step is fairly simple, and we may get better results this way.

I'm completely confused as to what exactly the requirements are for this code, so I will turn the code over to you, and you should then adjust the code as necessary.

Jim

``````Data A;
INPUT	ID_A     Date_A    Cycle_A;

Cycle_A_1	=	Cycle_A - 1;
DATALINES;
1           2015         2016
1           2016         2016
1           2017         2018
1           2018         2018
2           2015         2016
2           2016         2016
2           2017         2018
3           2013         2014
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A
Out	=A1;
BY	ID_A Cycle_A	Date_A;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A
Out	=A2;
BY	ID_A Cycle_A_1	Date_A;
RUN;

**------------------------------------------------------------------------------**;

Data B;
INPUT	ID_B     Date_B;

DATALINES;
1           2015
1           2016
1           2017
2           2016
2           2017
3           2018
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=B;
BY	ID_B Date_B;
RUN;

**------------------------------------------------------------------------------**;

DATA	A1_B;
DROP	_:;

FORMAT	ID Date_A Cycle_A Date_B Cycle_A_1;
LENGTH	Match	\$10;

MERGE	A1	(IN=A	RENAME=(ID_A=ID Cycle_A=Date_B))
B	(IN=B 	RENAME=(ID_B=ID));
BY	ID	Date_B;

IF	A	AND	B;

Cycle_A		=	Date_B;
Match		=	'A B Direct';
RUN;

**------------------------------------------------------------------------------**;

DATA	A2_B;
DROP	_:;

FORMAT	ID Date_A Cycle_A Date_B Cycle_A_1;
LENGTH	Match	\$10;

MERGE	A2	(IN=A	RENAME=(ID_A=ID Cycle_A_1=Date_B))
B	(IN=B 	RENAME=(ID_B=ID));
BY	ID	Date_B;

IF	A	AND	B;

Cycle_A_1	=	Date_B;
Match		=	'A-1 B';
RUN;

**------------------------------------------------------------------------------**;

PROC	SQL	NOPRINT;
CREATE	TABLE	A_B_Combined	AS
SELECT	*	FROM	A1_B
UNION
SELECT	*	FROM	A2_B
;
QUIT;

**------------------------------------------------------------------------------**;
**	Pre-Sort.																	**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B_Combined;
BY	_ALL_;
RUN;

**------------------------------------------------------------------------------**;
**	Sort de-dup.																**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B_Combined	NODUPKEY
OUT	=A_B_Final;
BY	ID Date_A Cycle_A Date_B;
RUN;
``````
16 REPLIES 16
Meteorite | Level 14

## Re: merge on two possibilities of one conditions

Hmmm.  Well, this is a bit tricky.  I have a program that produces the results you want (see bottom of this post).  I switched from a Join to a Merge because there's more detail level control with a Merge.  The program works on this set of data, but this is a tricky enough problem that I'd like you to really test the code on more data before you use it on anything that you'd be presenting to clients, employers, etc.  I think I've got it, but I might have missed something.  Testing with more data is strongly recommended.

Let me know if you come up with any testing abnormalities, and I'll take a look.

Jim

Here are the results.

And here is the program.  I'm using a modified version of the Hop macro by @LeonidBatkhan to read ahead in the SAS dataset in order to make sure I'm getting all the matches.

``````Data A;
INPUT	ID_A     Date_A    Cycle_A;

_Cycle_A_1	=	Cycle_A - 1;
DATALINES;
1           2015         2016
1           2016         2016
1           2017         2018
1           2018         2018
2           2015         2016
2           2016         2016
2           2017         2018
3           2013         2014
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A;
BY	ID_A Cycle_A;
RUN;

**------------------------------------------------------------------------------**;

Data B;
INPUT	ID_B     Date_B;

DATALINES;
1           2015
1           2016
1           2017
2           2016
2           2017
3           2018
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=B;
BY	ID_B Date_B;
RUN;

**------------------------------------------------------------------------------**;
**	to the next record (or records) or read previous.  Based on the Hop macro	**;
**	by Leonid Batkhan.															**;
**------------------------------------------------------------------------------**;
%MACRO	Get_Var_By_Pointer(d, x, y, j);
_p_	=	_n_ + &j;
IF	(1 <= _p_ <= _o_)	THEN
DO;
_Found		=	1;
SET	&d(KEEP=&x RENAME=(&x=&y)) POINT=_p_ NOBS=_o_;
END;
ELSE
DO;
_Found		=	0;
END;
%MEND	Get_Var_By_Pointer;

**------------------------------------------------------------------------------**;

DATA	A_B;
DROP	_:;

FORMAT	ID Date_A Cycle_A Date_B _Cycle_A_1;

MERGE	A	(IN=A RENAME=(ID_A=ID))
B	(IN=B RENAME=(ID_B=ID));
BY	ID;

IF	A	AND	B;

IF	Cycle_A				=	Date_B			THEN
OUTPUT;
ELSE
IF	_Cycle_A_1			=	Date_B			THEN
OUTPUT;

%Get_Var_By_Pointer(A, ID_A, _ID, +1);
IF	_Found									AND
ID					=	_ID				THEN
DO;
%Get_Var_By_Pointer(A, Cycle_A,		Cycle_A,	+1);
%Get_Var_By_Pointer(A, Date_A,		Date_A,		+1);
IF	_Found							THEN
IF	Date_B	=	Cycle_A			THEN
OUTPUT;
END;

DELETE;
RUN;

**------------------------------------------------------------------------------**;
**	This Sort de-dup is not necessary with the data I have, but just in case.	**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B	NODUPKEY;
BY	_ALL_;
RUN;
``````

Tourmaline | Level 20

## Re: merge on two possibilities of one conditions

The red 2017 for ID1 does not match any value.

Meteorite | Level 14

## Re: merge on two possibilities of one conditions

This is a rather odd merge.  @azsham has not one but two criteria for a date match:

1. Date_B = Cycle_A
2. Date_B = (Cycle_A - 1)

I believe the red 2017 matches based on criteria 2, yes?

Jim

Tourmaline | Level 20

## Re: merge on two possibilities of one conditions

Of course! Maybe I should learn to read!

Unless the volumes are large and performance is an issue, I think it is best for ease of maintenance and legibility to split the joins.

``````proc sql;
create table T1 as
select a.*, b.DATE_B
from A left join B
on ID_A=ID_B and CYCLE_A=DATE_B;

create table T2 as
select a.ID_A ,a.Date_A, a.Cycle_A, coalesce(a.DATE_B,b.DATE_B)as DATE_B
from T1 a left join B
on ID_A=ID_B and CYCLE_A-1=b.DATE_B
where calculated DATE_B ne .;

``````

Obsidian | Level 7

## Re: merge on two possibilities of one conditions

I made the following adjustments to your code and it seems to work but I need more time to make sure it isn't missing some.

``````proc sql;
create table T1 as
select a.*, b.DATE_B
from A left  join B
on ID_A=ID_B and Date_A=DATE_B;

create table T2 as
select a.*,b.*, coalesce(a.DATE_B,b.DATE_B)as DATE_B
from T1 a left join B
on ID_A=ID_B and CYCLE_A-1=b.DATE_B
where calculated DATE_B ne .;
quit;``````
Tourmaline | Level 20

## Re: merge on two possibilities of one conditions

You can make T1 a view, or a subquery. It might help performance. Or not.

Obsidian | Level 7

## Re: merge on two possibilities of one conditions

Hello All,

That is right @jimbarbour. The red 2017 matches on criteria 2

Meteorite | Level 14

## Re: merge on two possibilities of one conditions

OK, good.  Then I understood correctly.

Did you get a chance to try the code I posted earlier?  If so, how did it go?

Jim

Obsidian | Level 7

## Re: merge on two possibilities of one conditions

Yes I did and I am trying to understand why it is giving me only about 13,700 observations and not all these observations are matched correctly. I am expecting a number of observations somewhere between 65,000 and 68,000. The code posted by @ChrisNZ is giving about 67,000 observations but it does not correctly match on both criteria.

I cannot post the data because of property rights but I will try to write another reply with some more details that might help.

Meteorite | Level 14

## Re: merge on two possibilities of one conditions

Can you give me some test data of matches that are of the type that my logic is not identifying?

Jim

Obsidian | Level 7

## Re: merge on two possibilities of one conditions

For this specific id (00029) I have date_B 1999 and 2000 from data B. But as you can see it is not picking this direct match. The one highlighted in yellow should be 1999 and the ones marked with red should be 2001.

Earlier, I was trying the following code to get first the easy (or direct) matches where date_A equals date_B.

``````proc sql;
create table A_B as
select *
from
A as a
Join B as b
on
a.Date_A=b.Date_B
and a.id_A=b.id_B;
quit;``````

but then I do not know how to get the ones that does not have direct matches like this. When there is no direct match, then I want to perform the second criteria where I try to find a match within the cycle. Cycles are even numbers (2018, 2016, 2014, 2012, ...etc) each cycle contains two years. For example cycle 2016 contains 2016 and 2015. If Date_A shows 2015 the cycle has to be 2016 and if it shows 2016 the cycle is 2016.

Meteorite | Level 14

## Re: merge on two possibilities of one conditions

I think I'd have to see the source data in order to understand.

Jim

Obsidian | Level 7

## Re: merge on two possibilities of one conditions

Thank you, @jimbarbour ! I really appreciate it. I would be happy to share screen through Microsoft Teams if you'd prefer.

Tourmaline | Level 20

## Re: merge on two possibilities of one conditions

first the easy (or direct) matches where date_A equals date_B

You never said anything about matching these 2 fields before.

Discussion stats
• 16 replies
• 1176 views
• 6 likes
• 3 in conversation