I normally use proc sql for merging tables. What I have is 2 tables that I want to merge together. Both tables have 2 IDs (call them ID1 and ID2). Because of time differences, etc. neither ID is 100% accurate at matching.
What I'd like is to do a left merge merging Table 2 onto Table 1 using ID1 and create a new variable equal to 1 if a match was found using ID1. After that, I'd like to do a second left merge of Table 2 onto Table 1 using ID2, but only if a match wasn't found using ID1 (and create a second new variable equal to 1 if a match was found using ID2). Although not necessary, it would also be nice if there were a way to not actually do a merge, but have a third new variable equal to 1 if a match was found using ID2 but wasn't performed because a match had already been found using ID1.
Here's an example:
Table 1:
ID1 | ID2 | Data1 |
---|---|---|
1 | ab | 80 |
2 | . | 100 |
. | b | 55 |
5 | x | 10 |
Table 2:
ID1 | ID2 | Data2 |
---|---|---|
. | a | 12 |
1 | ab | 14 |
4 | b | 15 |
5 | . | 22 |
WANTED RESULT TABLE:
ID1 | ID2 | Data1 | Data2 | Merged on ID1 | Merged on ID2 | Match with ID2 but not merged |
---|---|---|---|---|---|---|
1 | ab | 80 | 14 | 1 | 0 | 1 |
2 | . | 100 | . | 0 | 0 | 0 |
. | b | 55 | 15 | 0 | 1 | 0 |
5 | x | 10 | 22 | 1 | 0 | 0 |
There are a lot of differences between a SQL and SAS datastep approach.
Using a SAS datastep Merge SAS(R) 9.3 Statements: Reference and the IN= dataset option SAS(R) 9.3 Data Set Options: Reference is given you quite a lot of requested logic.
What you are really wanting with matched/unmatched observations or possible using a table-lookup on the additional identifier is not clear.
As you are possible in a data-cleansing stage it could you are also not knowing all details. By that you need to investigate.
This gets you your required result:
data table1;
input ID1 ID2:$ Data1;
cards;
1 ab 80
2 . 100
. b 55
5 x 10
;
run;
data table2;
input ID1 ID2:$ Data2;
cards;
. a 12
1 ab 14
4 b 15
5 . 22
;
run;
proc sort data=table1;
by ID1;
run;
proc sort data=table2;
by ID1;
run;
data
merge1 (keep=ID1 ID2 data1 data2 var1 var3)
rest1 (keep=ID1 ID2 data1)
rest2 (keep=ID1 ID21 data2 rename=(ID21=ID2))
;
merge
table1 (in=a)
table2 (
in=b
rename=(ID2=ID21)
)
;
by ID1;
if a and b and ID1 ne .
then do;
var1 = 1;
if ID2 = ID21
then var3 = 1;
else var3 = 0;
output merge1;
end;
else do;
var1 = 0;
if a
then output rest1;
if b
then do;
ID2 = ID21;
output rest2;
end;
end;
run;
proc sort data=rest1;
by ID2;
run;
proc sort data=rest2;
by ID2;
run;
data
merge2 (drop=ID11)
rest3 (drop=ID11)
;
merge
rest1 (in=a)
rest2 (in=b rename=(ID1=ID11))
;
by ID2;
if a and b
then do;
var2 = 1;
output merge2;
end;
else if ID1 ne .
then do;
if a then output rest3;
else do;
ID1 = ID11;
output rest3;
end;
end;
run;
data result;
format ID1 ID2 data1 data2 var1 var2 var3; *puts the vars in order;
set merge1 merge2 rest3;
if var1 = . then var1 = 0;
if var2 = . then var2 = 0;
if var3 = . then var3 = 0;
run;
var1 = matched by ID1
var2 = matched by ID2;
var3 = ID2 would have matched, but ID1 took preference
var2 =
While you could probably achieve this with hashing, my inclination is to use formats instead. This solution will fail if Table 2 contains duplicate values for either ID1 or ID2, but then you would expect that to be a problem case. This code assumes that ID1 and ID2 are both character, although that issue can be circumvented if necessary.
First, create two formats from Table 2:
data table2_revised;
set table2 end=done;
length label $ 9;
label=data2;
fmtname='$id1fmt';
start=id1;
output;
start=id2;
output;
if done;
hlo='O';
label='Not Found';
fmtname='$id1fmt';
output;
fmtname='$id2fmt';
output;
run;
proc sort data=table2_revised;
by fmtname;
run;
proc format cntlin=table2_revised;
run;
This gives you 2 formats to work with, $id1fmt and $id2fmt, which translate ID1 and ID2 respectively into the matching value for DATA2. There might be a little more complication if DATA 2 is numeric ... it might need to be converted to character along these lines:
label = put(data2, best9.);
At any rate, once you have the formats, the hard part is over. You can use the formats along these lines:
data want;
set table1;
match_from_id1 = put(id1, $id1fmt.);
match_from_id2 = put(id2, $id2fmt.);
run;
Then you can play with the two new variables in any way you see fit.
Good luck.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.