I want to merge (using left join) using the id variable in each set and based on var1, var2, var3...etc. (var1-36). However, I don't want to require that they all match. For instance, I want the information to merge whether the only match is a.var2 and b.var7 or if ten match or if they all match. Furthermore, there are missing values for some variables for almost all observations, so I want to ensure that missing matches are not captured. Any help is appreciated.
proc sql;
create table want as
select a.*, b.info
from have1 as a left join have2 as b
on a.id = b.id and
(a.var1-36 = b.var1-36)
order by id, date;
quit;
If using SQL with SAS tables then here one way to go.
data have;
input unique id var1 var2 var3 var4 info;
datalines;
1 1 . . . 1234 54
2 1 32423 3713 1234 328931 26
3 1 3713 1234 123412 3253 82
4 2 4567 . . 12 93
5 2 . . . 1267 102
6 2 12 145 86 92 96
7 3 . 8214 1479 . 123
8 3 . . . . 85
9 3 987 345 7528 93842 146
;
run;
proc sql;
create table want as
select a.*, b.info as merged_info
from have as a left join have as b
on a.unique ne b.unique and a.id = b.id and
(
whichn(a.var1,.,b.var1,b.var2,b.var3,b.var4) >1 or
whichn(a.var2,.,b.var1,b.var2,b.var3,b.var4) >1 or
whichn(a.var3,.,b.var1,b.var2,b.var3,b.var4) >1 or
whichn(a.var4,.,b.var1,b.var2,b.var3,b.var4) >1
)
order by unique, id
;
quit;
Are you saying you want a match of the IDs match and a.var1 is equal to any of the fields b.var1 through b.var36 ?
Sort of. I want a match if the IDs match and any of a.var1, a.var2, a.var3 (i.e., a.var1-36) is equal to any of b.var1, b.var2, b.var3 (i.e., b.var1-36).
Example input data from both sets and the result needed.
Assume the first data set is the have dataset (it can be have1 and have2 for simplicity). Assume the second is the desired result. The variables I used are a little different, so I'll update my original code as well.
proc sql;
create table want as
select a.*, b.info as merged_info
from have as a left join have as b
on a.unique ne b.unique and a.id = b.id and
(a.var1-4 = b.var1-4)
order by unique, id;
quit;
@jjb123 wrote
.... Furthermore, there are missing values for some variables for almost all observations, so I want to ensure that missing matches are not captured. Any help is appreciated.
So what do you want to do if the variable is missing in one data set and not missing in the other? Does that constitute a non-match?
I'm not sure I understand your question. A missing value (which would only be a match if it matches to another missing value) should always be considered a non-match.
Here you go.
data have;
input unique id var1 var2 var3 var4 info;
datalines;
1 1 . . . 1234 54
2 1 32423 3713 1234 328931 26
3 1 3713 1234 123412 3253 82
4 2 4567 . . 12 93
5 2 . . . 1267 102
6 2 12 145 86 92 96
7 3 . 8214 1479 . 123
8 3 . . . . 85
9 3 987 345 7528 93842 146
;
Thanks, but again incomplete. In your sql code three datasets are mentioned: a, b and want. Which of the three is "have"?
If using SQL with SAS tables then here one way to go.
data have;
input unique id var1 var2 var3 var4 info;
datalines;
1 1 . . . 1234 54
2 1 32423 3713 1234 328931 26
3 1 3713 1234 123412 3253 82
4 2 4567 . . 12 93
5 2 . . . 1267 102
6 2 12 145 86 92 96
7 3 . 8214 1479 . 123
8 3 . . . . 85
9 3 987 345 7528 93842 146
;
run;
proc sql;
create table want as
select a.*, b.info as merged_info
from have as a left join have as b
on a.unique ne b.unique and a.id = b.id and
(
whichn(a.var1,.,b.var1,b.var2,b.var3,b.var4) >1 or
whichn(a.var2,.,b.var1,b.var2,b.var3,b.var4) >1 or
whichn(a.var3,.,b.var1,b.var2,b.var3,b.var4) >1 or
whichn(a.var4,.,b.var1,b.var2,b.var3,b.var4) >1
)
order by unique, id
;
quit;
This works like a charm. Thank you.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.