BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Batman
Quartz | Level 8

I'm trying to compare results from a data step merge and full join in sql.   I get a higher number of output obs using sql and I'm not sure why.    It isn't a problem of duplicates, there is just row for each value of "bene_sk" in each input file.

 

83 /*Method 1: merge with data step*/
84 data tbl1;
85 merge lomajor16 lominor16 upmajor16;
86 by bene_sk;
87 run;

NOTE: There were 29625 observations read from the data set WORK.LOMAJOR16.
NOTE: There were 36241 observations read from the data set WORK.LOMINOR16.
NOTE: There were 396 observations read from the data set WORK.UPMAJOR16.
NOTE: The data set WORK.TBL1 has 60904 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 1.39 seconds
cpu time 0.16 seconds

88
89 /*Method 2: merge with sql*/
90 proc sql;
91 create table tbl2 as
92 select coalesce(a.bene_sk, b.bene_sk, c.bene_sk) as bene_sk ,a.lomajor, b.lominor, c.upmajor
93 from lomajor16 as a
94 full join lominor16 as b
95 on a.bene_sk=b.bene_sk
96 full join upmajor16 as c
97 on b.bene_sk=c.bene_sk
98 order by bene_sk
99 ;
NOTE: SAS threaded sort was used.
NOTE: Table WORK.TBL2 created, with 60933 rows and 4 columns.

100 quit;

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Your SQL code looks wrong (it is certainly confusing).  What does it even mean to full join in a chain like that?

 

Let's make some example data.

data one;
  input id name1 $30.;
  s1=1;
cards;
1 all three
2 one only
3 one and two
4 one and two and three
;
data two;
  input id name2 $30.;
  s2=1;
cards;
1 all three
3 one and two
5 two only
;
data three;
  input id name3 $30.;
  s3=1;
cards;
1 all three
4 one and two and three
6 three only
;

So when I merge these

data ds_merge;
  merge one two three;
  by id;
run;

there are 6 observations.

Tom_0-1692982003147.png

If I tell SQL to combine two of them and combine the resulting set with the third 

create table sql_join as 
select * 
  from three natural full join (select * from two natural full join one) x 
  order by 1,2
;

I get the same 6 observations

Tom_1-1692982107158.png

But if I leave it up to the PROC to try to combine three datasets will chained full joins 

create table sql_join2 as 
select * 
  from two natural full join three natural full join one 
  order by 1,2
;

I get something different

Tom_2-1692982195736.png

To get it to work properly without the nesting the subsequent joins have to refer back to all of the previous dataset's key values, using OR condition.

proc sql feedback;
create table sql_join as
select coalesce(one.id,two.id,three.id) as id
  ,name1,s1,name2,s2,name3,s3
  from one 
   full join two 
     on one.id=two.id
   full join three
     on three.id = one.id or three.id=two.id
  order by 1
;
quit;

View solution in original post

4 REPLIES 4
PaigeMiller
Diamond | Level 26

MERGE and FULL JOIN will not produce the same output in all cases, and since we don't have your data, it's hard to be more specific. 

--
Paige Miller
ballardw
Super User

Perhaps looking at a much smaller example will allow tracing what happens. The crux is that the joins are sort of sequential, one join then joined on another.

 

data one;
  input a x;
datalines;
1 1
2 2
3 3
;

data two;
  input a y;
datalines;
1 11
3 33
;

data three;
  input a z;
datalines;
2  222
3  333
;


data merged;
   merge one two three;
   by a;
run;

proc sql;
   create table joined as
 select coalesce(a.a, b.a, c.a) as a ,a.x, b.y, c.z
 from one as a
 full join two as b
 on a.a=b.a
 full join three as c
 on b.a=c.a
 order by a
 ;
quit;

Assuming no duplicate values of the BY variable(s) the Merge will result in the number of observations equal to the number of by variable combinations because of the way the other variables on the merged sets behave with matching By values.

Note that the other variables don't behave the same with JOIN, so you get uncombined versions of the observations.

Batman
Quartz | Level 8

ok, I see the problem.   It looks like I'm getting duplicate bene_sk values whenever there is a match on the first and third data sets.   I've printed the first several rows of the duplicate bene_sk cases below.  I can't print that variable in this forum, but each two rows has same value and they alternate between having a value in lomajor or upmajor.   Is there a place I can insert a statement of "on a.bene_sk=c.bene_sk"?

 

 

84 proc sql;
2 The SAS System 09:42 Friday, August 25, 2023

85 create table d as
86 select *
87 from tbl2
88 group by bene_sk
89 having count(*) gt 1;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.D created, with 58 rows and 4 columns.

90 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds

91
92 data _null_;
93 set d;
94 put lomajor lominor upmajor;
95 run;

1 . .
. . 1
. . 1
1 . .
. . 1
1 . .
. . 1
1 . .
1 . .
. . 1
. . 1
1 . .
. . 1
1 . .

Tom
Super User Tom
Super User

Your SQL code looks wrong (it is certainly confusing).  What does it even mean to full join in a chain like that?

 

Let's make some example data.

data one;
  input id name1 $30.;
  s1=1;
cards;
1 all three
2 one only
3 one and two
4 one and two and three
;
data two;
  input id name2 $30.;
  s2=1;
cards;
1 all three
3 one and two
5 two only
;
data three;
  input id name3 $30.;
  s3=1;
cards;
1 all three
4 one and two and three
6 three only
;

So when I merge these

data ds_merge;
  merge one two three;
  by id;
run;

there are 6 observations.

Tom_0-1692982003147.png

If I tell SQL to combine two of them and combine the resulting set with the third 

create table sql_join as 
select * 
  from three natural full join (select * from two natural full join one) x 
  order by 1,2
;

I get the same 6 observations

Tom_1-1692982107158.png

But if I leave it up to the PROC to try to combine three datasets will chained full joins 

create table sql_join2 as 
select * 
  from two natural full join three natural full join one 
  order by 1,2
;

I get something different

Tom_2-1692982195736.png

To get it to work properly without the nesting the subsequent joins have to refer back to all of the previous dataset's key values, using OR condition.

proc sql feedback;
create table sql_join as
select coalesce(one.id,two.id,three.id) as id
  ,name1,s1,name2,s2,name3,s3
  from one 
   full join two 
     on one.id=two.id
   full join three
     on three.id = one.id or three.id=two.id
  order by 1
;
quit;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 493 views
  • 0 likes
  • 4 in conversation