I normally use proc sql for merging tables. What I have is 2 tables that I want to merge together. Both tables have 2 IDs (call them ID1 and ID2). Because of time differences, etc. neither ID is 100% accurate at matching.
What I'd like is to do a left merge merging Table 2 onto Table 1 using ID1 and create a new variable equal to 1 if a match was found using ID1. After that, I'd like to do a second left merge of Table 2 onto Table 1 using ID2, but only if a match wasn't found using ID1 (and create a second new variable equal to 1 if a match was found using ID2). Although not necessary, it would also be nice if there were a way to not actually do a merge, but have a third new variable equal to 1 if a match was found using ID2 but wasn't performed because a match had already been found using ID1.
Here's an example:
Table 1:
| ID1 | ID2 | Data1 |
|---|---|---|
| 1 | ab | 80 |
| 2 | . | 100 |
| . | b | 55 |
| 5 | x | 10 |
Table 2:
| ID1 | ID2 | Data2 |
|---|---|---|
| . | a | 12 |
| 1 | ab | 14 |
| 4 | b | 15 |
| 5 | . | 22 |
WANTED RESULT TABLE:
| ID1 | ID2 | Data1 | Data2 | Merged on ID1 | Merged on ID2 | Match with ID2 but not merged |
|---|---|---|---|---|---|---|
| 1 | ab | 80 | 14 | 1 | 0 | 1 |
| 2 | . | 100 | . | 0 | 0 | 0 |
| . | b | 55 | 15 | 0 | 1 | 0 |
| 5 | x | 10 | 22 | 1 | 0 | 0 |
There are a lot of differences between a SQL and SAS datastep approach.
Using a SAS datastep Merge SAS(R) 9.3 Statements: Reference and the IN= dataset option SAS(R) 9.3 Data Set Options: Reference is given you quite a lot of requested logic.
What you are really wanting with matched/unmatched observations or possible using a table-lookup on the additional identifier is not clear.
As you are possible in a data-cleansing stage it could you are also not knowing all details. By that you need to investigate.
This gets you your required result:
data table1;
input ID1 ID2:$ Data1;
cards;
1 ab 80
2 . 100
. b 55
5 x 10
;
run;
data table2;
input ID1 ID2:$ Data2;
cards;
. a 12
1 ab 14
4 b 15
5 . 22
;
run;
proc sort data=table1;
by ID1;
run;
proc sort data=table2;
by ID1;
run;
data
merge1 (keep=ID1 ID2 data1 data2 var1 var3)
rest1 (keep=ID1 ID2 data1)
rest2 (keep=ID1 ID21 data2 rename=(ID21=ID2))
;
merge
table1 (in=a)
table2 (
in=b
rename=(ID2=ID21)
)
;
by ID1;
if a and b and ID1 ne .
then do;
var1 = 1;
if ID2 = ID21
then var3 = 1;
else var3 = 0;
output merge1;
end;
else do;
var1 = 0;
if a
then output rest1;
if b
then do;
ID2 = ID21;
output rest2;
end;
end;
run;
proc sort data=rest1;
by ID2;
run;
proc sort data=rest2;
by ID2;
run;
data
merge2 (drop=ID11)
rest3 (drop=ID11)
;
merge
rest1 (in=a)
rest2 (in=b rename=(ID1=ID11))
;
by ID2;
if a and b
then do;
var2 = 1;
output merge2;
end;
else if ID1 ne .
then do;
if a then output rest3;
else do;
ID1 = ID11;
output rest3;
end;
end;
run;
data result;
format ID1 ID2 data1 data2 var1 var2 var3; *puts the vars in order;
set merge1 merge2 rest3;
if var1 = . then var1 = 0;
if var2 = . then var2 = 0;
if var3 = . then var3 = 0;
run;
var1 = matched by ID1
var2 = matched by ID2;
var3 = ID2 would have matched, but ID1 took preference
var2 =
While you could probably achieve this with hashing, my inclination is to use formats instead. This solution will fail if Table 2 contains duplicate values for either ID1 or ID2, but then you would expect that to be a problem case. This code assumes that ID1 and ID2 are both character, although that issue can be circumvented if necessary.
First, create two formats from Table 2:
data table2_revised;
set table2 end=done;
length label $ 9;
label=data2;
fmtname='$id1fmt';
start=id1;
output;
start=id2;
output;
if done;
hlo='O';
label='Not Found';
fmtname='$id1fmt';
output;
fmtname='$id2fmt';
output;
run;
proc sort data=table2_revised;
by fmtname;
run;
proc format cntlin=table2_revised;
run;
This gives you 2 formats to work with, $id1fmt and $id2fmt, which translate ID1 and ID2 respectively into the matching value for DATA2. There might be a little more complication if DATA 2 is numeric ... it might need to be converted to character along these lines:
label = put(data2, best9.);
At any rate, once you have the formats, the hard part is over. You can use the formats along these lines:
data want;
set table1;
match_from_id1 = put(id1, $id1fmt.);
match_from_id2 = put(id2, $id2fmt.);
run;
Then you can play with the two new variables in any way you see fit.
Good luck.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.