I'm trying to compare results from a data step merge and full join in sql. I get a higher number of output obs using sql and I'm not sure why. It isn't a problem of duplicates, there is just row for each value of "bene_sk" in each input file.
83 /*Method 1: merge with data step*/
84 data tbl1;
85 merge lomajor16 lominor16 upmajor16;
86 by bene_sk;
87 run;
NOTE: There were 29625 observations read from the data set WORK.LOMAJOR16.
NOTE: There were 36241 observations read from the data set WORK.LOMINOR16.
NOTE: There were 396 observations read from the data set WORK.UPMAJOR16.
NOTE: The data set WORK.TBL1 has 60904 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 1.39 seconds
cpu time 0.16 seconds
88
89 /*Method 2: merge with sql*/
90 proc sql;
91 create table tbl2 as
92 select coalesce(a.bene_sk, b.bene_sk, c.bene_sk) as bene_sk ,a.lomajor, b.lominor, c.upmajor
93 from lomajor16 as a
94 full join lominor16 as b
95 on a.bene_sk=b.bene_sk
96 full join upmajor16 as c
97 on b.bene_sk=c.bene_sk
98 order by bene_sk
99 ;
NOTE: SAS threaded sort was used.
NOTE: Table WORK.TBL2 created, with 60933 rows and 4 columns.
100 quit;
Your SQL code looks wrong (it is certainly confusing). What does it even mean to full join in a chain like that?
Let's make some example data.
data one;
input id name1 $30.;
s1=1;
cards;
1 all three
2 one only
3 one and two
4 one and two and three
;
data two;
input id name2 $30.;
s2=1;
cards;
1 all three
3 one and two
5 two only
;
data three;
input id name3 $30.;
s3=1;
cards;
1 all three
4 one and two and three
6 three only
;
So when I merge these
data ds_merge;
merge one two three;
by id;
run;
there are 6 observations.
If I tell SQL to combine two of them and combine the resulting set with the third
create table sql_join as
select *
from three natural full join (select * from two natural full join one) x
order by 1,2
;
I get the same 6 observations
But if I leave it up to the PROC to try to combine three datasets will chained full joins
create table sql_join2 as
select *
from two natural full join three natural full join one
order by 1,2
;
I get something different
To get it to work properly without the nesting the subsequent joins have to refer back to all of the previous dataset's key values, using OR condition.
proc sql feedback;
create table sql_join as
select coalesce(one.id,two.id,three.id) as id
,name1,s1,name2,s2,name3,s3
from one
full join two
on one.id=two.id
full join three
on three.id = one.id or three.id=two.id
order by 1
;
quit;
MERGE and FULL JOIN will not produce the same output in all cases, and since we don't have your data, it's hard to be more specific.
Perhaps looking at a much smaller example will allow tracing what happens. The crux is that the joins are sort of sequential, one join then joined on another.
data one; input a x; datalines; 1 1 2 2 3 3 ; data two; input a y; datalines; 1 11 3 33 ; data three; input a z; datalines; 2 222 3 333 ; data merged; merge one two three; by a; run; proc sql; create table joined as select coalesce(a.a, b.a, c.a) as a ,a.x, b.y, c.z from one as a full join two as b on a.a=b.a full join three as c on b.a=c.a order by a ; quit;
Assuming no duplicate values of the BY variable(s) the Merge will result in the number of observations equal to the number of by variable combinations because of the way the other variables on the merged sets behave with matching By values.
Note that the other variables don't behave the same with JOIN, so you get uncombined versions of the observations.
ok, I see the problem. It looks like I'm getting duplicate bene_sk values whenever there is a match on the first and third data sets. I've printed the first several rows of the duplicate bene_sk cases below. I can't print that variable in this forum, but each two rows has same value and they alternate between having a value in lomajor or upmajor. Is there a place I can insert a statement of "on a.bene_sk=c.bene_sk"?
84 proc sql;
2 The SAS System 09:42 Friday, August 25, 2023
85 create table d as
86 select *
87 from tbl2
88 group by bene_sk
89 having count(*) gt 1;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.D created, with 58 rows and 4 columns.
90 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
91
92 data _null_;
93 set d;
94 put lomajor lominor upmajor;
95 run;
1 . .
. . 1
. . 1
1 . .
. . 1
1 . .
. . 1
1 . .
1 . .
. . 1
. . 1
1 . .
. . 1
1 . .
Your SQL code looks wrong (it is certainly confusing). What does it even mean to full join in a chain like that?
Let's make some example data.
data one;
input id name1 $30.;
s1=1;
cards;
1 all three
2 one only
3 one and two
4 one and two and three
;
data two;
input id name2 $30.;
s2=1;
cards;
1 all three
3 one and two
5 two only
;
data three;
input id name3 $30.;
s3=1;
cards;
1 all three
4 one and two and three
6 three only
;
So when I merge these
data ds_merge;
merge one two three;
by id;
run;
there are 6 observations.
If I tell SQL to combine two of them and combine the resulting set with the third
create table sql_join as
select *
from three natural full join (select * from two natural full join one) x
order by 1,2
;
I get the same 6 observations
But if I leave it up to the PROC to try to combine three datasets will chained full joins
create table sql_join2 as
select *
from two natural full join three natural full join one
order by 1,2
;
I get something different
To get it to work properly without the nesting the subsequent joins have to refer back to all of the previous dataset's key values, using OR condition.
proc sql feedback;
create table sql_join as
select coalesce(one.id,two.id,three.id) as id
,name1,s1,name2,s2,name3,s3
from one
full join two
on one.id=two.id
full join three
on three.id = one.id or three.id=two.id
order by 1
;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.