Find observations that were eliminate in proc sql merge

teamlinerek · Posted 11-18-2023 07:37 PM

Hi Everyone,

I am merging two data sets - A and B;

Data Set A

firstname secondnameinitial thirdname

John A Doe

James B Crow

Jim C Throw

Jonathan D Brown

Data Set B

Forename1 Forename2Initial Surname ID

John A Doe 12345

John A Doe 67890

James B Crow 15123

James B Crow 12518

Jim C Throw 30987

Jane D Scott 23484

I want matches with all of the information from both tables. I am merging on first name, middle initial, and last name using the following code:

proc sql;
create table want as
select * from have a
inner join have b
on (upper(trim(a.firstname))=upper(trim(b.Forename1))) and (upper(trim(a.thirdname))=upper(trim(b.Surname)))
and (upper(trim(a.SecondNameInitial))=upper(trim(b.Forename2Initial)));
quit;

My question is this:

My output is about 10,000 names. How do I identify the duplicate matches that were eliminated during the merge process due to identical names? I can manually look at additional information to determine which names should be properly matched.

Thank you!

Kurt_Bremser · Posted 11-19-2023 03:04 AM

You show data for datasets "A" and "B", but in your SQL you do a self join of a dataset "HAVE".

Further, SQL does not eliminate duplicate matches, it creates a cartesian join.

Please clarify what you really want, and post the expected result for the data you posted.

@teamlinerek wrote:

Hi Everyone,

I am merging two data sets - A and B;

Data Set A

firstname secondnameinitial thirdname

John A Doe

John A Doe

John A Doe

James B Crow

Jim C Throw

Jonathan D Brown

Data Set B

Forename1 Forename2Initial Surname ID

John A Doe 12345

John A Doe 67890

James B Crow 15123

James B Crow 12518

Jim C Throw 30987

Jane D Scott 23484

I want matches with all of the information from both tables. I am merging on first name, middle initial, and last name using the following code:

proc sql;
create table want as
select * from have a
inner join have b
on (upper(trim(a.firstname))=upper(trim(b.Forename1))) and (upper(trim(a.thirdname))=upper(trim(b.Surname)))
and (upper(trim(a.SecondNameInitial))=upper(trim(b.Forename2Initial)));
quit;

My question is this:

My output is about 10,000 names. How do I identify the duplicate matches that were eliminated during the merge process due to identical names? I can manually look at additional information to determine which names should be properly matched.

Thank you!

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 11-19-2023 11:39 PM

I believe the query you want is as below (not tested because you didn't provide SAS code to create such sample data).

proc sql;
  create table want as
  select a.*,b.* 
  from 
    (select distinct firstname,thirdname,SecondNameInitial from table_1) a
    inner join 
    table_2 b
      on 
            (upper(trim(a.firstname))         = upper(trim(b.Forename1))) 
        and (upper(trim(a.thirdname))         = upper(trim(b.Surname)))
        and (upper(trim(a.SecondNameInitial)) = upper(trim(b.Forename2Initial)))
  ;
quit;

If you don't first dedup your table 1 then the relationship between the tables is many:many which is almost never a good situation.

Find observations that were eliminate in proc sql merge

Re: Find observations that were eliminate in proc sql merge

Re: Find observations that were eliminate in proc sql merge

Find observations that were eliminate in proc sql merge

Re: Find observations that were eliminate in proc sql merge

Re: Find observations that were eliminate in proc sql merge

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!