Re: Merge (conditional?) with markers?

yeaforme · Posted 05-09-2014 01:17 AM

I normally use proc sql for merging tables. What I have is 2 tables that I want to merge together. Both tables have 2 IDs (call them ID1 and ID2). Because of time differences, etc. neither ID is 100% accurate at matching.

What I'd like is to do a left merge merging Table 2 onto Table 1 using ID1 and create a new variable equal to 1 if a match was found using ID1. After that, I'd like to do a second left merge of Table 2 onto Table 1 using ID2, but only if a match wasn't found using ID1 (and create a second new variable equal to 1 if a match was found using ID2). Although not necessary, it would also be nice if there were a way to not actually do a merge, but have a third new variable equal to 1 if a match was found using ID2 but wasn't performed because a match had already been found using ID1.

Here's an example:

Table 1:

ID1	ID2	Data1
1	ab	80
2	.	100
.	b	55
5	x	10

Table 2:

ID1	ID2	Data2
.	a	12
1	ab	14
4	b	15
5	.	22

WANTED RESULT TABLE:

ID1	ID2	Data1	Data2	Merged on ID1	Merged on ID2	Match with ID2 but not merged
1	ab	80	14	1	0	1
2	.	100	.	0	0	0
.	b	55	15	0	1	0
5	x	10	22	1	0	0

jakarman · Posted 05-09-2014 02:37 AM

There are a lot of differences between a SQL and SAS datastep approach.

Using a SAS datastep Merge SAS(R) 9.3 Statements: Reference and the IN= dataset option SAS(R) 9.3 Data Set Options: Reference is given you quite a lot of requested logic.

What you are really wanting with matched/unmatched observations or possible using a table-lookup on the additional identifier is not clear.

As you are possible in a data-cleansing stage it could you are also not knowing all details. By that you need to investigate.

---->-- ja karman --<-----

Kurt_Bremser · Posted 05-09-2014 04:55 AM

This gets you your required result:

data table1;

input ID1 ID2:$ Data1;

cards;

1 ab 80

2 . 100

. b 55

5 x 10

;

run;

data table2;

input ID1 ID2:$ Data2;

cards;

. a 12

1 ab 14

4 b 15

5 . 22

;

run;

proc sort data=table1;

by ID1;

run;

proc sort data=table2;

by ID1;

run;

data

merge1 (keep=ID1 ID2 data1 data2 var1 var3)

rest1 (keep=ID1 ID2 data1)

rest2 (keep=ID1 ID21 data2 rename=(ID21=ID2))

;

merge

table1 (in=a)

table2 (

in=b

rename=(ID2=ID21)

)

;

by ID1;

if a and b and ID1 ne .

then do;

var1 = 1;

if ID2 = ID21

then var3 = 1;

else var3 = 0;

output merge1;

end;

else do;

var1 = 0;

if a

then output rest1;

if b

then do;

ID2 = ID21;

output rest2;

end;

run;

proc sort data=rest1;

by ID2;

run;

proc sort data=rest2;

by ID2;

run;

data

merge2 (drop=ID11)

rest3 (drop=ID11)

;

merge

rest1 (in=a)

rest2 (in=b rename=(ID1=ID11))

;

by ID2;

if a and b

then do;

var2 = 1;

output merge2;

end;

else if ID1 ne .

then do;

if a then output rest3;

else do;

ID1 = ID11;

output rest3;

end;

run;

data result;

format ID1 ID2 data1 data2 var1 var2 var3; *puts the vars in order;

set merge1 merge2 rest3;

if var1 = . then var1 = 0;

if var2 = . then var2 = 0;

if var3 = . then var3 = 0;

run;

var1 = matched by ID1

var2 = matched by ID2;

var3 = ID2 would have matched, but ID1 took preference

var2 =

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 05-09-2014 10:06 AM

While you could probably achieve this with hashing, my inclination is to use formats instead. This solution will fail if Table 2 contains duplicate values for either ID1 or ID2, but then you would expect that to be a problem case. This code assumes that ID1 and ID2 are both character, although that issue can be circumvented if necessary.

First, create two formats from Table 2:

data table2_revised;

set table2 end=done;

length label $ 9;

label=data2;

fmtname='$id1fmt';

start=id1;

output;

start=id2;

output;

if done;

hlo='O';

label='Not Found';

fmtname='$id1fmt';

output;

fmtname='$id2fmt';

output;

run;

proc sort data=table2_revised;

by fmtname;

run;

proc format cntlin=table2_revised;

run;

This gives you 2 formats to work with, $id1fmt and $id2fmt, which translate ID1 and ID2 respectively into the matching value for DATA2. There might be a little more complication if DATA 2 is numeric ... it might need to be converted to character along these lines:

label = put(data2, best9.);

At any rate, once you have the formats, the hard part is over. You can use the formats along these lines:

data want;

set table1;

match_from_id1 = put(id1, $id1fmt.);

match_from_id2 = put(id2, $id2fmt.);

run;

Then you can play with the two new variables in any way you see fit.

Good luck.

Merge (conditional?) with markers?