BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
azsham
Obsidian | Level 7

Hello All,

I am trying to merge two datasets. I have

Data A:

id_a     Date_a    Cycle_a

1           2015         2016

1           2016         2016

1           2017         2018

1           2018         2018

2           2015         2016

2           2016         2016

2           2017         2018

3           2013         2014

 

Data B:

id_b     Date_b 

1           2015

1           2016

1           2017    

2           2016    

2           2017    

3           2018    

 

I want

id_a     Date_a    Cycle_a     id_b     Date_b 

1           2015         2016        1          2015

1           2016         2016        1          2016

1           2017         2018        1          2017

1           2018         2018        1          2017

2           2015         2016        2          2016

2           2016         2016        2          2016

2           2017         2018        2          2017

 

I highlighted the problematic dates in red. I am trying to merge using two possibilities of the date condition along with the id. So I was using the following code:

proc sql; create table want as select *
from Data_a as a
Join Data_b as b 
on (a.Cycle_a=b.Date_b or (a.Cycle_a-1)=b.Date_b)
and a.id_a=b.id_b;
quit;

However, using this code produce duplicates and it is not accurate to just drop duplicates as I may drop good matches.

I would like the program to perform this part first (a.Cycle_a=b.Date_b) and if there is no match it goes the second part ((a.Cycle_a-1)=b.Date_b). If it does not match then drop the observation (e.g. the observation with id =3 in my example above).

 

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

@azsham,

 

Here is another approach, using a double merge if you'd like to try playing with the code.  It's a larger number of steps, but each step is fairly simple, and we may get better results this way.

 

I'm completely confused as to what exactly the requirements are for this code, so I will turn the code over to you, and you should then adjust the code as necessary.

 

Jim

 

Data A;
	INPUT	ID_A     Date_A    Cycle_A;

	Cycle_A_1	=	Cycle_A - 1;
DATALINES;
1           2015         2016
1           2016         2016
1           2017         2018
1           2018         2018
2           2015         2016
2           2016         2016
2           2017         2018
3           2013         2014
;
RUN;
 
**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A
				Out	=A1;
	BY	ID_A Cycle_A	Date_A;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A
				Out	=A2;
	BY	ID_A Cycle_A_1	Date_A;
RUN;

**------------------------------------------------------------------------------**;

Data B;
	INPUT	ID_B     Date_B;

DATALINES;
1           2015
1           2016
1           2017
2           2016
2           2017
3           2018
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=B;
	BY	ID_B Date_B;
RUN;

**------------------------------------------------------------------------------**;

DATA	A1_B;
	DROP	_:;

	FORMAT	ID Date_A Cycle_A Date_B Cycle_A_1;
	LENGTH	Match	$10;

	MERGE	A1	(IN=A	RENAME=(ID_A=ID Cycle_A=Date_B))
			B	(IN=B 	RENAME=(ID_B=ID));
		BY	ID	Date_B;

	IF	A	AND	B;

	Cycle_A		=	Date_B;
	Match		=	'A B Direct';
RUN;

**------------------------------------------------------------------------------**;

DATA	A2_B;
	DROP	_:;

	FORMAT	ID Date_A Cycle_A Date_B Cycle_A_1;
	LENGTH	Match	$10;

	MERGE	A2	(IN=A	RENAME=(ID_A=ID Cycle_A_1=Date_B))
			B	(IN=B 	RENAME=(ID_B=ID));
		BY	ID	Date_B;

	IF	A	AND	B;

	Cycle_A_1	=	Date_B;
	Match		=	'A-1 B';
RUN;

**------------------------------------------------------------------------------**;

PROC	SQL	NOPRINT;
	CREATE	TABLE	A_B_Combined	AS
		SELECT	*	FROM	A1_B
			UNION
		SELECT	*	FROM	A2_B
		;
QUIT;

**------------------------------------------------------------------------------**;
**	Pre-Sort.																	**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B_Combined;
	BY	_ALL_;
RUN;

**------------------------------------------------------------------------------**;
**	Sort de-dup.																**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B_Combined	NODUPKEY
				OUT	=A_B_Final;
	BY	ID Date_A Cycle_A Date_B;
RUN;

View solution in original post

16 REPLIES 16
jimbarbour
Meteorite | Level 14

Hmmm.  Well, this is a bit tricky.  I have a program that produces the results you want (see bottom of this post).  I switched from a Join to a Merge because there's more detail level control with a Merge.  The program works on this set of data, but this is a tricky enough problem that I'd like you to really test the code on more data before you use it on anything that you'd be presenting to clients, employers, etc.  I think I've got it, but I might have missed something.  Testing with more data is strongly recommended.

 

Let me know if you come up with any testing abnormalities, and I'll take a look.

 

Jim

 

Here are the results.

jimbarbour_0-1602105311016.png

 

And here is the program.  I'm using a modified version of the Hop macro by @LeonidBatkhan to read ahead in the SAS dataset in order to make sure I'm getting all the matches.

Data A;
	INPUT	ID_A     Date_A    Cycle_A;

	_Cycle_A_1	=	Cycle_A - 1;
DATALINES;
1           2015         2016
1           2016         2016
1           2017         2018
1           2018         2018
2           2015         2016
2           2016         2016
2           2017         2018
3           2013         2014
;
RUN;
 
**------------------------------------------------------------------------------**;

PROC	SORT	DATA=A;
	BY	ID_A Cycle_A;
RUN;

**------------------------------------------------------------------------------**;

Data B;
	INPUT	ID_B     Date_B;

DATALINES;
1           2015
1           2016
1           2017
2           2016
2           2017
3           2018
;
RUN;

**------------------------------------------------------------------------------**;

PROC	SORT	DATA=B;
	BY	ID_B Date_B;
RUN;

**------------------------------------------------------------------------------**;
**	Macro Get_Var_By_Pointer does a read based on a pointer and can read ahead	**;
**	to the next record (or records) or read previous.  Based on the Hop macro	**;
**	by Leonid Batkhan.															**;
**------------------------------------------------------------------------------**;
%MACRO	Get_Var_By_Pointer(d, x, y, j);
	_p_	=	_n_ + &j;
	IF	(1 <= _p_ <= _o_)	THEN
		DO;
			_Found		=	1;
			SET	&d(KEEP=&x RENAME=(&x=&y)) POINT=_p_ NOBS=_o_;
		END;
	ELSE
		DO;
			_Found		=	0;
		END;
%MEND	Get_Var_By_Pointer;

**------------------------------------------------------------------------------**;

DATA	A_B;
	DROP	_:;

	FORMAT	ID Date_A Cycle_A Date_B _Cycle_A_1;

	MERGE	A	(IN=A RENAME=(ID_A=ID))
			B	(IN=B RENAME=(ID_B=ID));
		BY	ID;

	IF	A	AND	B;

	IF	Cycle_A				=	Date_B			THEN
		OUTPUT;
	ELSE
	IF	_Cycle_A_1			=	Date_B			THEN
		OUTPUT;

	%Get_Var_By_Pointer(A, ID_A, _ID, +1);
	IF	_Found									AND
		ID					=	_ID				THEN
		DO;
			%Get_Var_By_Pointer(A, Cycle_A,		Cycle_A,	+1);
			%Get_Var_By_Pointer(A, Date_A,		Date_A,		+1);
			IF	_Found							THEN
				IF	Date_B	=	Cycle_A			THEN
					OUTPUT;
		END;

	DELETE;
RUN;

**------------------------------------------------------------------------------**;
**	This Sort de-dup is not necessary with the data I have, but just in case.	**;
**------------------------------------------------------------------------------**;
PROC	SORT	DATA=A_B	NODUPKEY;
	BY	_ALL_;
RUN;

 

ChrisNZ
Tourmaline | Level 20

The red 2017 for ID1 does not match any value.

jimbarbour
Meteorite | Level 14

@ChrisNZ.

 

This is a rather odd merge.  @azsham has not one but two criteria for a date match:

  1. Date_B = Cycle_A
  2. Date_B = (Cycle_A - 1)

I believe the red 2017 matches based on criteria 2, yes?  

 

Jim

jimbarbour_0-1602118918921.png

 

ChrisNZ
Tourmaline | Level 20

Of course! Maybe I should learn to read!

Unless the volumes are large and performance is an issue, I think it is best for ease of maintenance and legibility to split the joins.

proc sql;
  create table T1 as
  select a.*, b.DATE_B 
  from A left join B
  on ID_A=ID_B and CYCLE_A=DATE_B; 

  create table T2 as
  select a.ID_A ,a.Date_A, a.Cycle_A, coalesce(a.DATE_B,b.DATE_B)as DATE_B
  from T1 a left join B
  on ID_A=ID_B and CYCLE_A-1=b.DATE_B
  where calculated DATE_B ne .; 

 

azsham
Obsidian | Level 7

@ChrisNZ 

I made the following adjustments to your code and it seems to work but I need more time to make sure it isn't missing some.

 

proc sql;
  create table T1 as
  select a.*, b.DATE_B 
  from A left  join B
  on ID_A=ID_B and Date_A=DATE_B; 

  create table T2 as
  select a.*,b.*, coalesce(a.DATE_B,b.DATE_B)as DATE_B
  from T1 a left join B
  on ID_A=ID_B and CYCLE_A-1=b.DATE_B
  where calculated DATE_B ne .; 
quit;
ChrisNZ
Tourmaline | Level 20

You can make T1 a view, or a subquery. It might help performance. Or not.

azsham
Obsidian | Level 7

Hello All,

 

That is right @jimbarbour. The red 2017 matches on criteria 2

 

 

jimbarbour
Meteorite | Level 14

@azsham,

 

OK, good.  Then I understood correctly.

 

Did you get a chance to try the code I posted earlier?  If so, how did it go?

 

Jim

azsham
Obsidian | Level 7

Yes I did and I am trying to understand why it is giving me only about 13,700 observations and not all these observations are matched correctly. I am expecting a number of observations somewhere between 65,000 and 68,000. The code posted by @ChrisNZ is giving about 67,000 observations but it does not correctly match on both criteria.

 

I cannot post the data because of property rights but I will try to write another reply with some more details that might help.

jimbarbour
Meteorite | Level 14

Can you give me some test data of matches that are of the type that my logic is not identifying?

 

Jim

azsham
Obsidian | Level 7

00029.JPG

 

For this specific id (00029) I have date_B 1999 and 2000 from data B. But as you can see it is not picking this direct match. The one highlighted in yellow should be 1999 and the ones marked with red should be 2001.

 

Earlier, I was trying the following code to get first the easy (or direct) matches where date_A equals date_B.

proc sql;
create table A_B as
select *
from
A as a
Join B as b 
on
a.Date_A=b.Date_B
and a.id_A=b.id_B;
quit;

but then I do not know how to get the ones that does not have direct matches like this. When there is no direct match, then I want to perform the second criteria where I try to find a match within the cycle. Cycles are even numbers (2018, 2016, 2014, 2012, ...etc) each cycle contains two years. For example cycle 2016 contains 2016 and 2015. If Date_A shows 2015 the cycle has to be 2016 and if it shows 2016 the cycle is 2016.

jimbarbour
Meteorite | Level 14

I think I'd have to see the source data in order to understand.

 

Jim

azsham
Obsidian | Level 7

Thank you, @jimbarbour ! I really appreciate it. I would be happy to share screen through Microsoft Teams if you'd prefer.

ChrisNZ
Tourmaline | Level 20

 first the easy (or direct) matches where date_A equals date_B

You never said anything about matching these 2 fields before.

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1587 views
  • 6 likes
  • 3 in conversation