Re: Merge data sets with alternative 2 keys

Ronein · Posted 08-25-2022 10:47 PM

Hello

I want to merge 2 data sets.

The issue is that there are 2 optional Keys to merge between the data sets(Key1,Key2).

Sometimes Key1 is available and sometimes Key2 is available and sometimes both are available.

What do you think is a better way to merge : way1 or way2?

Maybe there is an alternative better way that you can recommend?

thanks a lot

Data t1;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 100000 10AUG2022
22222 987231 200000 10AUG2022
33333 . 300000 10AUG2022
. 897277 400000 10AUG2022
;
Run;


Data t2;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 120000 01JAN2022
22222 987231 170000 01JAN2022
33333 . 200000 01JAN2022
. 897277 0 01JAN2022
;
Run;

proc sql;
create table way1 as
select a.Key1,a.Key2, a.wealth_10Aug2022,b.wealth_01JAN2022
from t1(rename=(wealth=wealth_10Aug2022)) as a
left join t2(rename=(wealth=wealth_01JAN2022)) as b
on a.Key1=b.Key1  OR a.Key2=b.Key2
order by a.key1,a.key2
;
quit;


proc sql;
create table way2 as
select a.Key1,a.Key2,a.wealth_10Aug2022,coalesce(b.wealth_01JAN2022,c.wealth_01JAN2022) as wealth_01JAN2022
from t1(rename=(wealth=wealth_10Aug2022)) as a
left join t2(rename=(wealth=wealth_01JAN2022)) as b
on a.Key1=b.Key1   
left join t2(rename=(wealth=wealth_01JAN2022)) as c
on  a.Key2=c.Key2
order by a.key1,a.key2
;
quit;

SASKiwi · Posted 08-26-2022 09:40 PM

Based on the example data you have provided you can use both keys even for the rows where one of the keys is missing. That's because joining a missing key to a missing key is still a match:

proc sql;
create table way1 as
select a.Key1,a.Key2, a.wealth_10Aug2022,b.wealth_01JAN2022
from t1(rename=(wealth=wealth_10Aug2022)) as a
left join t2(rename=(wealth=wealth_01JAN2022)) as b
on a.Key1=b.Key1 and a.Key2=b.Key2
order by a.key1,a.key2
;
quit;

Ronein · Posted 08-26-2022 11:35 PM

I will change the data in data sets and then your code will not work well.

Data t1;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 100000 10AUG2022
22222 987231 200000 10AUG2022
33333 321567 300000 10AUG2022
. 897277 400000 10AUG2022
;
Run;


Data t2;
format date date9.;
input Key1 Key2 wealth date : date9.;
cards;
11111 487777 120000 01JAN2022
22222 987231 170000 01JAN2022
33333 . 200000 01JAN2022
44444 897277 0 01JAN2022
;
Run

Patrick · Posted 08-27-2022 07:54 AM

That's one of these questions where you need to be very specific and besides of representative sample data best also share the desired result.

What should happen if there is a match on one and on two keys? (look at rows 3 and 5 in below result)

What if all keys are missing?

Data t1;
  format date date9.;
  input id Key1 Key2 wealth date : date9.;
  cards;
1 11111 487777 100000 10AUG2022
2 22222 987231 200000 10AUG2022
3 33333 321567 300000 10AUG2022
4 . 897277 400000 10AUG2022
;

Data t2;
  format date date9.;
  input id Key1 Key2 wealth date : date9.;
  cards;
1 11111 487777 120000 01JAN2022
2 22222 987231 170000 01JAN2022
3 33333 . 200000 01JAN2022
4 44444 897277 0 01JAN2022
;

proc sql;
  create table way1 as
    select 
      a.id as a_id,
      b.id as b_id,
      a.Key1 as a_key1,
      a.Key2 as a_key2, 
      b.Key1 as b_key1,
      b.Key2 as b_key2, 
      a.wealth as wealth_10Aug2022,
      b.wealth as wealth_01JAN2022
    from t1 as a
      left join t2 as b
        on 
          (a.Key1=b.Key1 or n(a.key1,b.key1)=1) 
          and (a.Key2=b.Key2 or n(a.key2,b.key2)=1) 
          and n(a.Key1,a.Key2) >0
      order by a.id, a.key1,a.key2
  ;
quit;

Ronein · Posted 08-27-2022 08:15 AM

Hello
If all keys are missing then it shouldn't be displayed on the wanted data set.

If there is a match on two keys then it should be displayed.

If there is a match on only one key ) key1 or key2) then it should be displayed

Ronein · Posted 08-27-2022 08:17 AM

What does it mean
n(a.key1,b.key1)=1)

Patrick · Posted 08-27-2022 08:20 AM

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/n0h1d8gxwr1ml7n1qdt35v3i6u7r.h...

Ronein · Posted 08-27-2022 08:34 AM

I know this functions but why should you use it in this example ?

Ronein · Posted 08-27-2022 08:39 AM

The number of rows in wanted data set should be equal to number of rows in data set t1. In your code there is increase in number of rows and then it is not the desire result

Tom · Posted 08-27-2022 10:52 AM

I am not sure I understand the join logic you are looking for.

Let's just try something an you can tell use why that is or isn't what you want.

This will test if BOTH keys match (ignoring missing values).

I used a LEFT JOIN so values from T2 that do not match anything are ignored.

data t1;
  row+1;
  input Key1 Key2 wealth date :date.;
  format date date9.;
cards;
11111 487777 100000 10AUG2022
22222 987231 200000 10AUG2022
33333 321567 300000 10AUG2022
. 897277 400000 10AUG2022
;

data t2;
  row+1;
  input Key1 Key2 wealth date :date.;
  format date date9.;
cards;
11111 487777 120000 01JAN2022
22222 987231 170000 01JAN2022
33333 . 200000 01JAN2022
44444 897277 0 01JAN2022
;


proc sql;
create table want as
  select
    a.row as aRow
  , b.row as bRow 
  , a.Key1 as aKey1
  , b.Key1 as bKey1
  , a.Key2 as aKey2
  , b.Key2 as bKey2
  , a.wealth as aWealth
  , b.wealth as bWealth
  , a.date as aDate
  , b.date as bDate
  from t1 a 
  left join t2 b
    on (a.key1=b.key1 or missing(a.key1) or missing(b.key1))
   and (a.key2=b.key2 or missing(a.key2) or missing(b.key2))
  order by 1,2
;
quit;

proc print;
run;

Result

                                                                a         b
Obs    aRow    bRow    aKey1    bKey1     aKey2     bKey2    Wealth    Wealth        aDate        bDate

 1       1       1     11111    11111    487777    487777    100000    120000    10AUG2022    01JAN2022
 2       2       2     22222    22222    987231    987231    200000    170000    10AUG2022    01JAN2022
 3       3       3     33333    33333    321567         .    300000    200000    10AUG2022    01JAN2022
 4       4       3         .    33333    897277         .    400000    200000    10AUG2022    01JAN2022
 5       4       4         .    44444    897277    897277    400000         0    10AUG2022    01JAN2022

If you want to eliminate that 4th observation in the output where both matches are based on missing values then add an additional condition that excludes such cases. So basically test if either there are two non-missing values of KEY1 or two non-missing values of KEY2.

   and (n(a.key1,b.key1)>1 or n(a.key2,b.key2)>1)

It might be harder to eliminate other types of duplicates. For that you probably need to set some type of ordering criteria and then pick the first (or last) match based on the ordering criteria.

Ronein · Posted 08-27-2022 11:32 AM

In my original post I showed way1 and way2 and both are working well

Registration is open

SAS Training: Just a Click Away