Hello
I want to merge 2 data sets.
The issue is that there are 2 optional Keys to merge between the data sets(Key1,Key2).
Sometimes Key1 is available and sometimes Key2 is available and sometimes both are available.
What do you think is a better way to merge : way1 or way2?
Maybe there is an alternative better way that you can recommend?
thanks a lot
Data t1;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 100000 10AUG2022
22222 987231 200000 10AUG2022
33333 . 300000 10AUG2022
. 897277 400000 10AUG2022
;
Run;
Data t2;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 120000 01JAN2022
22222 987231 170000 01JAN2022
33333 . 200000 01JAN2022
. 897277 0 01JAN2022
;
Run;
proc sql;
create table way1 as
select a.Key1,a.Key2, a.wealth_10Aug2022,b.wealth_01JAN2022
from t1(rename=(wealth=wealth_10Aug2022)) as a
left join t2(rename=(wealth=wealth_01JAN2022)) as b
on a.Key1=b.Key1 OR a.Key2=b.Key2
order by a.key1,a.key2
;
quit;
proc sql;
create table way2 as
select a.Key1,a.Key2,a.wealth_10Aug2022,coalesce(b.wealth_01JAN2022,c.wealth_01JAN2022) as wealth_01JAN2022
from t1(rename=(wealth=wealth_10Aug2022)) as a
left join t2(rename=(wealth=wealth_01JAN2022)) as b
on a.Key1=b.Key1
left join t2(rename=(wealth=wealth_01JAN2022)) as c
on a.Key2=c.Key2
order by a.key1,a.key2
;
quit;
Based on the example data you have provided you can use both keys even for the rows where one of the keys is missing. That's because joining a missing key to a missing key is still a match:
proc sql;
create table way1 as
select a.Key1,a.Key2, a.wealth_10Aug2022,b.wealth_01JAN2022
from t1(rename=(wealth=wealth_10Aug2022)) as a
left join t2(rename=(wealth=wealth_01JAN2022)) as b
on a.Key1=b.Key1 and a.Key2=b.Key2
order by a.key1,a.key2
;
quit;
I will change the data in data sets and then your code will not work well.
Data t1;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 100000 10AUG2022
22222 987231 200000 10AUG2022
33333 321567 300000 10AUG2022
. 897277 400000 10AUG2022
;
Run;
Data t2;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 120000 01JAN2022
22222 987231 170000 01JAN2022
33333 . 200000 01JAN2022
44444 897277 0 01JAN2022
;
Run
That's one of these questions where you need to be very specific and besides of representative sample data best also share the desired result.
What should happen if there is a match on one and on two keys? (look at rows 3 and 5 in below result)
What if all keys are missing?
Data t1;
format date date9.;
input id Key1 Key2 wealth date : date9.;
cards;
1 11111 487777 100000 10AUG2022
2 22222 987231 200000 10AUG2022
3 33333 321567 300000 10AUG2022
4 . 897277 400000 10AUG2022
;
Data t2;
format date date9.;
input id Key1 Key2 wealth date : date9.;
cards;
1 11111 487777 120000 01JAN2022
2 22222 987231 170000 01JAN2022
3 33333 . 200000 01JAN2022
4 44444 897277 0 01JAN2022
;
proc sql;
create table way1 as
select
a.id as a_id,
b.id as b_id,
a.Key1 as a_key1,
a.Key2 as a_key2,
b.Key1 as b_key1,
b.Key2 as b_key2,
a.wealth as wealth_10Aug2022,
b.wealth as wealth_01JAN2022
from t1 as a
left join t2 as b
on
(a.Key1=b.Key1 or n(a.key1,b.key1)=1)
and (a.Key2=b.Key2 or n(a.key2,b.key2)=1)
and n(a.Key1,a.Key2) >0
order by a.id, a.key1,a.key2
;
quit;
I am not sure I understand the join logic you are looking for.
Let's just try something an you can tell use why that is or isn't what you want.
This will test if BOTH keys match (ignoring missing values).
I used a LEFT JOIN so values from T2 that do not match anything are ignored.
data t1;
row+1;
input Key1 Key2 wealth date :date.;
format date date9.;
cards;
11111 487777 100000 10AUG2022
22222 987231 200000 10AUG2022
33333 321567 300000 10AUG2022
. 897277 400000 10AUG2022
;
data t2;
row+1;
input Key1 Key2 wealth date :date.;
format date date9.;
cards;
11111 487777 120000 01JAN2022
22222 987231 170000 01JAN2022
33333 . 200000 01JAN2022
44444 897277 0 01JAN2022
;
proc sql;
create table want as
select
a.row as aRow
, b.row as bRow
, a.Key1 as aKey1
, b.Key1 as bKey1
, a.Key2 as aKey2
, b.Key2 as bKey2
, a.wealth as aWealth
, b.wealth as bWealth
, a.date as aDate
, b.date as bDate
from t1 a
left join t2 b
on (a.key1=b.key1 or missing(a.key1) or missing(b.key1))
and (a.key2=b.key2 or missing(a.key2) or missing(b.key2))
order by 1,2
;
quit;
proc print;
run;
Result
a b Obs aRow bRow aKey1 bKey1 aKey2 bKey2 Wealth Wealth aDate bDate 1 1 1 11111 11111 487777 487777 100000 120000 10AUG2022 01JAN2022 2 2 2 22222 22222 987231 987231 200000 170000 10AUG2022 01JAN2022 3 3 3 33333 33333 321567 . 300000 200000 10AUG2022 01JAN2022 4 4 3 . 33333 897277 . 400000 200000 10AUG2022 01JAN2022 5 4 4 . 44444 897277 897277 400000 0 10AUG2022 01JAN2022
If you want to eliminate that 4th observation in the output where both matches are based on missing values then add an additional condition that excludes such cases. So basically test if either there are two non-missing values of KEY1 or two non-missing values of KEY2.
and (n(a.key1,b.key1)>1 or n(a.key2,b.key2)>1)
It might be harder to eliminate other types of duplicates. For that you probably need to set some type of ordering criteria and then pick the first (or last) match based on the ordering criteria.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.