Solved: Re: Merge if any of a set of variables matches

jjb123 · Posted 03-25-2019 02:30 PM

I want to merge (using left join) using the id variable in each set and based on var1, var2, var3...etc. (var1-36). However, I don't want to require that they all match. For instance, I want the information to merge whether the only match is a.var2 and b.var7 or if ten match or if they all match. Furthermore, there are missing values for some variables for almost all observations, so I want to ensure that missing matches are not captured. Any help is appreciated.

proc sql;
create table want as
select 		a.*, b.info
from 		have1 as a left join have2 as b
on			a.id = b.id and
 (a.var1-36 = b.var1-36)
order by           id, date;
quit;

Patrick · Posted 03-26-2019 03:04 AM

@jjb123

If using SQL with SAS tables then here one way to go.

data have;
  input unique id var1 var2 var3 var4 info;
datalines;
1 1 . . . 1234 54
2 1 32423 3713 1234 328931 26
3 1 3713 1234 123412 3253 82
4 2 4567 . . 12 93
5 2 . . . 1267 102
6 2 12 145 86 92 96
7 3 . 8214 1479 . 123
8 3 . . . . 85
9 3 987 345 7528 93842 146
;
run;

proc sql;
  create table want as
    select  a.*, b.info as merged_info
      from  have as a left join have as b
        on  a.unique ne b.unique and a.id = b.id and
            (
              whichn(a.var1,.,b.var1,b.var2,b.var3,b.var4) >1 or
              whichn(a.var2,.,b.var1,b.var2,b.var3,b.var4) >1 or
              whichn(a.var3,.,b.var1,b.var2,b.var3,b.var4) >1 or
              whichn(a.var4,.,b.var1,b.var2,b.var3,b.var4) >1
            )
      order by unique, id
      ;
quit;

View solution in original post

Astounding · Posted 03-25-2019 02:48 PM

Are you saying you want a match of the IDs match and a.var1 is equal to any of the fields b.var1 through b.var36 ?

jjb123 · Posted 03-25-2019 03:11 PM

Sort of. I want a match if the IDs match and any of a.var1, a.var2, a.var3 (i.e., a.var1-36) is equal to any of b.var1, b.var2, b.var3 (i.e., b.var1-36).

ballardw · Posted 03-25-2019 04:21 PM

Example input data from both sets and the result needed.

jjb123 · Posted 03-25-2019 06:45 PM

Assume the first data set is the have dataset (it can be have1 and have2 for simplicity). Assume the second is the desired result. The variables I used are a little different, so I'll update my original code as well.

proc sql;
create table want as
select 		a.*, b.info as merged_info
from 		have as a left join have as b
on			a.unique ne b.unique and a.id = b.id and
 (a.var1-4 = b.var1-4)
order by           unique, id;
quit;

mkeintz · Posted 03-25-2019 06:47 PM

@jjb123 wrote

.... Furthermore, there are missing values for some variables for almost all observations, so I want to ensure that missing matches are not captured. Any help is appreciated.

So what do you want to do if the variable is missing in one data set and not missing in the other? Does that constitute a non-match?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jjb123 · Posted 03-25-2019 07:23 PM

I'm not sure I understand your question. A missing value (which would only be a match if it matches to another missing value) should always be considered a non-match.

andreas_lds · Posted 03-25-2019 09:52 PM

If you want (tested) code provide data in usable form.

jjb123 · Posted 03-25-2019 11:32 PM

Here you go.

data have;
input unique id var1 var2 var3 var4 info;
datalines;
1 1 . . . 1234 54
2 1 32423 3713 1234 328931 26
3 1 3713 1234 123412 3253 82
4 2 4567 . . 12 93
5 2 . . . 1267 102
6 2 12 145 86 92 96
7 3 . 8214 1479 . 123
8 3 . . . . 85
9 3 987 345 7528 93842 146
;

andreas_lds · Posted 03-26-2019 02:47 AM

Thanks, but again incomplete. In your sql code three datasets are mentioned: a, b and want. Which of the three is "have"?

Patrick · Posted 03-26-2019 03:04 AM

@jjb123

If using SQL with SAS tables then here one way to go.

data have;
  input unique id var1 var2 var3 var4 info;
datalines;
1 1 . . . 1234 54
2 1 32423 3713 1234 328931 26
3 1 3713 1234 123412 3253 82
4 2 4567 . . 12 93
5 2 . . . 1267 102
6 2 12 145 86 92 96
7 3 . 8214 1479 . 123
8 3 . . . . 85
9 3 987 345 7528 93842 146
;
run;

proc sql;
  create table want as
    select  a.*, b.info as merged_info
      from  have as a left join have as b
        on  a.unique ne b.unique and a.id = b.id and
            (
              whichn(a.var1,.,b.var1,b.var2,b.var3,b.var4) >1 or
              whichn(a.var2,.,b.var1,b.var2,b.var3,b.var4) >1 or
              whichn(a.var3,.,b.var1,b.var2,b.var3,b.var4) >1 or
              whichn(a.var4,.,b.var1,b.var2,b.var3,b.var4) >1
            )
      order by unique, id
      ;
quit;

jjb123 · Posted 03-26-2019 05:11 PM

This works like a charm. Thank you.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away