How can the following be done in sql?
data merged_data;
merge address_A (in = A) address_B (in = B);
by streetnum streetname zip;
if A and not B then match = "A only";
if B and not A then match = "B only";
if A and B then match = "match";
run;
There are a number of papers that provide answers to your three questions. E.g., take a look at:
Thanks for the reply.But it does not talk about the logic that is used in the if statements.
page 5 talks about "left join" (in A), "right join" (in B), and "full join" (in A or B). It doesn't explicitly create the "match" variables though.
Using the same example datasets used in the paper, does the following provide a sufficient example? I am not a sql whiz, thus there may be better ways of doing the 3 types of merges:
DATA STATES;
LENGTH STATE $ 2 ZIP $ 5;
INPUT ZIP $ STATE $;
CARDS;
08000 NJ
10000 NY
19000 PA
;
RUN;
DATA CITYS;
LENGTH ZIP $ 5 CITY $ 15;
INPUT ZIP $ CITY $;
CARDS;
10000 NEWYORK
19000 PHILADELPHIA
90000 LOSANGELES
;
RUN;
PROC SQL;
SELECT A.ZIP, A.STATE, B.CITY
FROM STATES A, CITYS B
where A.ZIP=B.ZIP
;
quit;
PROC SQL;
SELECT A.ZIP, A.STATE, B.CITY
FROM STATES A left join CITYS B
on A.ZIP=B.ZIP
having missing(b.zip)
;
quit;
PROC SQL;
SELECT A.ZIP, A.STATE, B.CITY
FROM STATES A right join CITYS B
on A.ZIP=B.ZIP
having missing(a.zip)
;
quit;
Hi ... first, in your data step, I make some changes ...
data merged_data;
merge address_A (in = A) address_B (in = B);
by streetnum streetname zip;
if A and B then match = "match ";
else
if A then match = "A only";
else match = "B only";
run;
As for your question, you can add similar logic to SQL with a CASE statement. Here's
something like your MERGE that uses a SELECT statement since that looks a bit like
a CASE statement in SQL. Then there's an SQL equivalent.
data one;
input id age @@;
datalines;
1 10 2 20 3 30
;
run;
data two;
input id zip @@;
datalines;
1 12202 3 23456 4 99901
;
run;
data merg;
merge one (in=a) two(in=b);
by id;
select;
when (a and b) match = "match";
when (a) match = "aonly";
otherwise match = "bonly";
end;
run;
proc sql;
create table sqel as
select coalesce (one.id, two.id) as id, age, zip,
case when (one.id and two.id) then "match"
when (one.id) then "aonly"
else "bonly"
end as match
from one full join two
on one.id eq two.id;
quit;
title "MERGE";
proc print data=merg noobs;
run;
title "SQL";
proc print data=sqel noobs;
run;
MERGE
id age zip match
1 10 12202 match
2 20 . aonly
3 30 23456 match
4 . 99901 bonly
SQL
id age zip match
1 10 12202 match
2 20 . aonly
3 30 23456 match
4 . 99901 bonly
ps from Howard Schreier's great book ... PROC SQL by Example: Using SQL within SAS
A couple of other techniques:
1. NATURAL joins
2. inline views to add flag columns to each side of the join
PS: Thanks for the plug!
MikeZdeb wrote:
Hi ... first, in your data step, I make some changes ...
data merged_data;
merge address_A (in = A) address_B (in = B);
by streetnum streetname zip;
if A and B then match = "match ";
else
if A then match = "A only";
else match = "B only";
run;
As for your question, you can add similar logic to SQL with a CASE statement. Here's
something like your MERGE that uses a SELECT statement since that looks a bit like
a CASE statement in SQL. Then there's an SQL equivalent.
data one;
input id age @@;
datalines;
1 10 2 20 3 30
;
run;
data two;
input id zip @@;
datalines;
1 12202 3 23456 4 99901
;
run;
data merg;
merge one (in=a) two(in=b);
by id;
select;
when (a and b) match = "match";
when (a) match = "aonly";
otherwise match = "bonly";
end;
run;
proc sql;
create table sqel as
select coalesce (one.id, two.id) as id, age, zip,
case when (one.id and two.id) then "match"
when (one.id) then "aonly"
else "bonly"
end as match
from one full join two
on one.id eq two.id;
quit;
title "MERGE";
proc print data=merg noobs;
run;
title "SQL";
proc print data=sqel noobs;
run;
MERGE
id age zip match
1 10 12202 match
2 20 . aonly
3 30 23456 match
4 . 99901 bonly
SQL
id age zip match
1 10 12202 match
2 20 . aonly
3 30 23456 match
4 . 99901 bonly
ps from Howard Schreier's great book ... PROC SQL by Example: Using SQL within SAS
You should use a full join and use and abuse the "coalesce" function, specially, for any common variable. A solution for your problem would be something like:
proc sql;
SELECT Coalesce(a.streetnum, b.streetnum) AS streetnum,
Coalesce(a.streetname, b.streetname) AS streetname,
Coalesce(a.zip, b.zip) AS zip,
CASE
WHEN NOT( a.streetnum IS NULL
AND a.streetname IS NULL
AND a.zip IS NULL )
AND NOT( b.streetnum IS NULL
AND b.streetname IS NULL
AND b.zip IS NULL ) THEN "match"
WHEN NOT( a.streetnum IS NULL
AND a.streetname IS NULL
AND a.zip IS NULL )
AND( b.streetnum IS NULL
AND b.streetname IS NULL
AND b.zip IS NULL ) THEN "A only"
WHEN( a.streetnum IS NULL
AND a.streetname IS NULL
AND a.zip IS NULL )
AND NOT( b.streetnum IS NULL
AND b.streetname IS NULL
AND b.zip IS NULL ) THEN "B only"
ELSE "match"
END AS match
FROM address_a AS a
FULL JOIN address_b AS b
ON ( a.streetnum = b.streetnum
AND a.streetname = b.streetname
AND a.zip = b.zip );
quit;
There are two main differences in the MERGE and the SQL FULL JOIN.
One is the way to check which dataset has contributed data. The CASE example above is most similar to the IF/THEN logic in your DATA step.
The other is what it does when there are variables with the same name in both datasets. For this you will need to use the COALESCE function. Remember to use for both your BY variables and any other other variables that occur in both datasets.
short answer is YES, but since you have a clearly understandable data step, WHY?
Peter,
It never hurts to know and, moreso, if one has to meet any of the three conditions in a many-to-many merge situation, it will be a lot easier to understand than the gyrations one would have to use to get the right result out of a datastep merge. No?
Art
you are right of course. . . I was hoping the op might explain as Hobbs has, but fearing it was just some more homework!
I am surprised at the quantity of sql needed and that something like this is not sufficient
select
case
when keyA=keyB then 'match'
when keyB is null then 'A file'
else 'B file'
end length=6 as match
from afile as a
full join bfile as b
;
And this is just half of the work because for each common variable you would need to write a new statement, something like,
Coalesce(b.common_var, a.common_var) AS common_var
As an example,
I had to use this solution many times with SAS Data Integration Studio 3.4 because there wasn't a built in "Merge Tranformation" therefore the most common way of merging datasets was using the SQL Join transformation.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.