Solved: Merge vs sql full join

Batman · Posted 08-25-2023 11:40 AM

I'm trying to compare results from a data step merge and full join in sql. I get a higher number of output obs using sql and I'm not sure why. It isn't a problem of duplicates, there is just row for each value of "bene_sk" in each input file.

83 /*Method 1: merge with data step*/
84 data tbl1;
85 merge lomajor16 lominor16 upmajor16;
86 by bene_sk;
87 run;

NOTE: There were 29625 observations read from the data set WORK.LOMAJOR16.
NOTE: There were 36241 observations read from the data set WORK.LOMINOR16.
NOTE: There were 396 observations read from the data set WORK.UPMAJOR16.
NOTE: The data set WORK.TBL1 has 60904 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 1.39 seconds
cpu time 0.16 seconds

88
89 /*Method 2: merge with sql*/
90 proc sql;
91 create table tbl2 as
92 select coalesce(a.bene_sk, b.bene_sk, c.bene_sk) as bene_sk ,a.lomajor, b.lominor, c.upmajor
93 from lomajor16 as a
94 full join lominor16 as b
95 on a.bene_sk=b.bene_sk
96 full join upmajor16 as c
97 on b.bene_sk=c.bene_sk
98 order by bene_sk
99 ;
NOTE: SAS threaded sort was used.
NOTE: Table WORK.TBL2 created, with 60933 rows and 4 columns.

100 quit;

Tom · Posted 08-25-2023 12:50 PM

Your SQL code looks wrong (it is certainly confusing). What does it even mean to full join in a chain like that?

Let's make some example data.

data one;
  input id name1 $30.;
  s1=1;
cards;
1 all three
2 one only
3 one and two
4 one and two and three
;
data two;
  input id name2 $30.;
  s2=1;
cards;
1 all three
3 one and two
5 two only
;
data three;
  input id name3 $30.;
  s3=1;
cards;
1 all three
4 one and two and three
6 three only
;

So when I merge these

data ds_merge;
  merge one two three;
  by id;
run;

there are 6 observations.

If I tell SQL to combine two of them and combine the resulting set with the third

create table sql_join as 
select * 
  from three natural full join (select * from two natural full join one) x 
  order by 1,2
;

I get the same 6 observations

But if I leave it up to the PROC to try to combine three datasets will chained full joins

create table sql_join2 as 
select * 
  from two natural full join three natural full join one 
  order by 1,2
;

I get something different

To get it to work properly without the nesting the subsequent joins have to refer back to all of the previous dataset's key values, using OR condition.

proc sql feedback;
create table sql_join as
select coalesce(one.id,two.id,three.id) as id
  ,name1,s1,name2,s2,name3,s3
  from one 
   full join two 
     on one.id=two.id
   full join three
     on three.id = one.id or three.id=two.id
  order by 1
;
quit;

View solution in original post

PaigeMiller · Posted 08-25-2023 11:45 AM

MERGE and FULL JOIN will not produce the same output in all cases, and since we don't have your data, it's hard to be more specific.

--
Paige Miller

ballardw · Posted 08-25-2023 11:56 AM

Perhaps looking at a much smaller example will allow tracing what happens. The crux is that the joins are sort of sequential, one join then joined on another.

data one;
  input a x;
datalines;
1 1
2 2
3 3
;

data two;
  input a y;
datalines;
1 11
3 33
;

data three;
  input a z;
datalines;
2  222
3  333
;


data merged;
   merge one two three;
   by a;
run;

proc sql;
   create table joined as
 select coalesce(a.a, b.a, c.a) as a ,a.x, b.y, c.z
 from one as a
 full join two as b
 on a.a=b.a
 full join three as c
 on b.a=c.a
 order by a
 ;
quit;

Assuming no duplicate values of the BY variable(s) the Merge will result in the number of observations equal to the number of by variable combinations because of the way the other variables on the merged sets behave with matching By values.

Note that the other variables don't behave the same with JOIN, so you get uncombined versions of the observations.

Batman · Posted 08-25-2023 12:34 PM

ok, I see the problem. It looks like I'm getting duplicate bene_sk values whenever there is a match on the first and third data sets. I've printed the first several rows of the duplicate bene_sk cases below. I can't print that variable in this forum, but each two rows has same value and they alternate between having a value in lomajor or upmajor. Is there a place I can insert a statement of "on a.bene_sk=c.bene_sk"?

84 proc sql;
2 The SAS System 09:42 Friday, August 25, 2023

85 create table d as
86 select *
87 from tbl2
88 group by bene_sk
89 having count(*) gt 1;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.D created, with 58 rows and 4 columns.

90 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds

91
92 data _null_;
93 set d;
94 put lomajor lominor upmajor;
95 run;

1 . .
. . 1
. . 1
1 . .
. . 1
1 . .
. . 1
1 . .
1 . .
. . 1
. . 1
1 . .
. . 1
1 . .

Tom · Posted 08-25-2023 12:50 PM

Your SQL code looks wrong (it is certainly confusing). What does it even mean to full join in a chain like that?

Let's make some example data.

data one;
  input id name1 $30.;
  s1=1;
cards;
1 all three
2 one only
3 one and two
4 one and two and three
;
data two;
  input id name2 $30.;
  s2=1;
cards;
1 all three
3 one and two
5 two only
;
data three;
  input id name3 $30.;
  s3=1;
cards;
1 all three
4 one and two and three
6 three only
;

So when I merge these

data ds_merge;
  merge one two three;
  by id;
run;

there are 6 observations.

If I tell SQL to combine two of them and combine the resulting set with the third

create table sql_join as 
select * 
  from three natural full join (select * from two natural full join one) x 
  order by 1,2
;

I get the same 6 observations

But if I leave it up to the PROC to try to combine three datasets will chained full joins

create table sql_join2 as 
select * 
  from two natural full join three natural full join one 
  order by 1,2
;

I get something different

To get it to work properly without the nesting the subsequent joins have to refer back to all of the previous dataset's key values, using OR condition.

proc sql feedback;
create table sql_join as
select coalesce(one.id,two.id,three.id) as id
  ,name1,s1,name2,s2,name3,s3
  from one 
   full join two 
     on one.id=two.id
   full join three
     on three.id = one.id or three.id=two.id
  order by 1
;
quit;

Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Re: Merge vs sql full join

Click image to register for webinar

Classroom Training Available!