Hello, I am trying to merge 4 tables and I would like all the final result to contain all of the values from each table, but when I try to do a full outer join it produces way too many results. The data are structured as shown below: table a has a unique id, a zip code, and a name. Table b has a zip code and a city, table c has a zip code and zcta, and table has a zcta and some data.
data a;
input id zip name $;
cards;
1 15217 Bob
2 15217 Sue
3 15213 Jack
4 85001 Ginger
5 91111 Mary
;
run;
data b;
input zip city $;
cards;
15217 Pittsburgh
15213 Pittsburgh
60290 Chicago
10001 New York
85001 Phoenix
;
run;
data c;
input zip zcta;
cards;
15217 15217
15213 15217
51331 51331
60290 60290
10001 10002
85001 85001
;
run;
data d;
input zcta data $;
cards;
15217 v1
51331 v2
60290 v3
10002 v4
85001 v5
95201 v6
;
run;
proc sql;
create table outer as
select a.id, a.name, a.zip,
b.city,
c.zcta,
d.data
from a as a
full outer join
(b as b,
c as c,
d as d)
on a.zip = b.zip
and a.zip = c.zip
and b.zip = c.zip
and c.zcta = d.zcta;
quit;
The result I'd like would be:
input id name $ zip zcta city $ data $ ;
cards;
1 Bob 15217 15217 Pittsburgh v1
2 Sue 15217 15217 Pittsburgh v1
3 Jack 15213 15217 Pittsburgh v1
4 Ginger 85001 85001 Phoenix v5
. . 10001 10002 NewYork v4
. . 60290 60290 Chicago v3
. . 51331 51331 . v2
. . . 95201 . v6
;
run;
Is there a straightforward way to do this? Thank you for your help!
Absolutely fabulous! Thank you so much for your insight.
Since all of these are only adding one variable, I'd say the easiest is to sort them all by Zip and then use a data step merge.
Or possibly create formats for ZCTA and whatever that DATA variable stands form . If the ZIP code is supposed to be the code for the City then the column city is not needed with the ZIPCity function available.
Thank you for these ideas. Actually, the adding of a single variable each dataset was only for simplicity, there are (many) other variables to merge as well. But I do like the suggestion of using a zip2zcta format to create a new variable in a dataset that contains zip rather than merging using the crosswalk table (table c).
Hi,
I included three different variations of solutions because it is unclear why Mary is being excluded. The first version includes Mary, but the latter two do not include Mary. Hope this helps.
hbi
/* this version includes Mary */
/* comment: from your example it is unclear why Mary should be excluded */
PROC SQL;
CREATE TABLE want_version1 AS
SELECT a.id
, a.name
, COALESCE(a.zip, b.zip, c.zip) as zip
, b.city
, COALESCE(c.zcta, d.zcta) as zcta
, d.data
FROM a
FULL OUTER JOIN b ON (b.zip = a.zip)
FULL OUTER JOIN c ON (c.zip = COALESCE(a.zip, b.zip))
FULL OUTER JOIN d ON (d.zcta = c.zcta)
ORDER BY 1, 2, 3, 4, 5, 6;
QUIT;
/* this version excludes Mary */
/* reason for excluding Mary is because: d.zcta is missing */
PROC SQL;
CREATE TABLE want_version2 AS
SELECT a.id
, a.name
, COALESCE(a.zip, b.zip, c.zip) as zip
, b.city
, COALESCE(c.zcta, d.zcta) as zcta
, d.data
FROM a
FULL OUTER JOIN b ON (b.zip = a.zip)
FULL OUTER JOIN c ON (c.zip = COALESCE(a.zip, b.zip))
FULL OUTER JOIN d ON (d.zcta = c.zcta)
WHERE d.zcta IS NOT MISSING
ORDER BY 1, 2, 3, 4, 5, 6;
QUIT;
/* this version excludes Mary */
/* reason for excluding Mary is because: d.data is missing */
PROC SQL;
CREATE TABLE want_version3 AS
SELECT a.id
, a.name
, COALESCE(a.zip, b.zip, c.zip) as zip
, b.city
, COALESCE(c.zcta, d.zcta) as zcta
, d.data
FROM a
FULL OUTER JOIN b ON (b.zip = a.zip)
FULL OUTER JOIN c ON (c.zip = COALESCE(a.zip, b.zip))
FULL OUTER JOIN d ON (d.zcta = c.zcta)
WHERE d.data IS NOT MISSING
ORDER BY 1, 2, 3, 4, 5, 6;
QUIT;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.