BookmarkSubscribeRSS Feed
yeaforme
Calcite | Level 5

I normally use proc sql for merging tables.  What I have is 2 tables that I want to merge together.  Both tables have 2 IDs (call them ID1 and ID2).  Because of time differences, etc. neither ID is 100% accurate at matching.

What I'd like is to do a left merge merging Table 2 onto Table 1 using ID1 and create a new variable equal to 1 if a match was found using ID1.  After that, I'd like to do a second left merge of Table 2 onto Table 1 using ID2, but only if a match wasn't found using ID1 (and create a second new variable equal to 1 if a match was found using ID2).  Although not necessary, it would also be nice if there were a way to not actually do a merge, but have a third new variable equal to 1 if a match was found using ID2 but wasn't performed because a match had already been found using ID1.

Here's an example:

Table 1:

ID1ID2Data1
1ab80
2.100
.b55
5x10

Table 2:

ID1ID2Data2
.a12
1ab14
4b15
5.22

WANTED RESULT TABLE:

ID1ID2Data1Data2Merged on ID1Merged on ID2Match with ID2 but not merged
1ab8014101
2.100.000
.b5515010
5x1022100
3 REPLIES 3
jakarman
Barite | Level 11

There are a lot of differences between a SQL and SAS datastep approach.

Using a SAS datastep Merge SAS(R) 9.3 Statements: Reference and the IN= dataset option SAS(R) 9.3 Data Set Options: Reference is given you quite a lot of requested logic.

What you are really wanting with matched/unmatched observations or possible using a table-lookup on the additional identifier is not clear.

As you are possible in a data-cleansing stage it could you are also not knowing all details. By that you need to investigate.

---->-- ja karman --<-----
Kurt_Bremser
Super User

This gets you your required result:

data table1;

input ID1 ID2:$ Data1;

cards;

1 ab 80

2 . 100

. b 55

5 x 10

;

run;

data table2;

input ID1 ID2:$ Data2;

cards;

. a 12

1 ab 14

4 b 15

5 . 22

;

run;

proc sort data=table1;

by ID1;

run;

proc sort data=table2;

by ID1;

run;

data

  merge1 (keep=ID1 ID2 data1 data2 var1 var3)

  rest1 (keep=ID1 ID2 data1)

  rest2 (keep=ID1 ID21 data2 rename=(ID21=ID2))

;

merge

  table1 (in=a)

  table2 (

    in=b

    rename=(ID2=ID21)

  )

;

by ID1;

if a and b and ID1 ne .

then do;

  var1 = 1;

  if ID2 = ID21

  then var3 = 1;

  else var3 = 0;

  output merge1;

end;

else do;

  var1 = 0;

  if a

  then output rest1;

  if b

  then do;

    ID2 = ID21;

    output rest2;

  end;

end;

run;

proc sort data=rest1;

by ID2;

run;

proc sort data=rest2;

by ID2;

run;

data

  merge2 (drop=ID11)

  rest3 (drop=ID11)

;

merge

  rest1 (in=a)

  rest2 (in=b rename=(ID1=ID11))

;

by ID2;

if a and b

then do;

  var2 = 1;

  output merge2;

end;

else if ID1 ne .

then do;

  if a then output rest3;

  else do;

    ID1 = ID11;

    output rest3;

  end;

end;

run;

data result;

format ID1 ID2 data1 data2 var1 var2 var3; *puts the vars in order;

set merge1 merge2 rest3;

if var1 = . then var1 = 0;

if var2 = . then var2 = 0;

if var3 = . then var3 = 0;

run;

var1 = matched by ID1

var2 = matched by ID2;

var3 = ID2 would have matched, but ID1 took preference

var2 =

Astounding
PROC Star

While you could probably achieve this with hashing, my inclination is to use formats instead.  This solution will fail if Table 2 contains duplicate values for either ID1 or ID2, but then you would expect that to be a problem case.  This code assumes that ID1 and ID2 are both character, although that issue can be circumvented if necessary.

First, create two formats from Table 2:

data table2_revised;

  set table2 end=done;

  length label $ 9;

  label=data2;

  fmtname='$id1fmt';

  start=id1;

  output;

  start=id2;

  output;

  if done;

  hlo='O';

  label='Not Found';

  fmtname='$id1fmt';

  output;

  fmtname='$id2fmt';

  output;

run;

proc sort data=table2_revised;

  by fmtname;

run;

proc format cntlin=table2_revised;

run;

This gives you 2 formats to work with, $id1fmt and $id2fmt, which translate ID1 and ID2 respectively into the matching value for DATA2.  There might be a little more complication if DATA 2 is numeric ... it might need to be converted to character along these lines:

label = put(data2, best9.);

At any rate, once you have the formats, the hard part is over.  You can use the formats along these lines:

data want;

  set table1;

  match_from_id1 = put(id1, $id1fmt.);

  match_from_id2 = put(id2, $id2fmt.);

run;

Then you can play with the two new variables in any way you see fit.

Good luck.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 1288 views
  • 0 likes
  • 4 in conversation