PROC SQL to find nonmatches

PaulN · Posted 03-04-2025 03:32 PM

/* Temp data set has 2594 unique names (i.e., 2594 unique observations) */

/* Create a table for matching names and retrieve firm ID (permno) from CSRP database */

/* Permno is the best way to search CRSP database. */

/* Data set Matches has 1277 observations */

/* Bobnamesoriginal is the company name I was given by my coauthor. */

/* Bobnamesnew is the company name after I clean the name to increase chances of */

/* finding a match in the CSRP database. */

/* Company_Name_Header is the name of the company inf the CSRP database. */

proc sql;

create table Matches as select b.permno, a.bobnamesnew, b.company_name_header,

a.bobnamesoriginal from work.Temp as a inner join names.crspnames as b on

a.bobnamesnew=b.company_name_header;

quit;

/* searching for duplicates */

/* No duplicates found */

/* 1277 unique permno */

proc sort data=matches nodupkey;

by permno;

run;

/* Create a table for nonmatches */

/* Since temp has 2594 unique names and I was able to match 1277 names, */

/* there should be 1317 = 2594 - 1277 nonmatch names */

/* However, the nonmatches data set has 1375 observations/nonmatches */

/* What am I doing wrong? I can't figure it out. */

proc sql;

create table nonmatches as select a.bobnamesnew from work.temp as a left join

names.crspnames as b on a.bobnamesnew=b.company_name_header where

b.company_name_header is null;

quit;

/* searching for duplicates */

/* No duplicates found */

/* Nonmatches still has 1375 observations. */

proc sort data=nonmatches nodupkey;

by bobnamesnew;

run;

quickbluefish · Posted 03-04-2025 05:41 PM

It's a little hard to follow what you're doing - for one thing, you can comment out multiple lines like this to make things easier to read:

/*
Here is a
comment that is
3 lines long.  */

Merging / joining on strings (especially when they're just names that are entered freehand) is not ideal as you've discovered, but sometimes that's all you have. You could try using some sort of "fuzzy" merge like this:

https://blogs.sas.com/content/sgf/2021/09/21/fuzzy-matching/

As for the join, I think you will have an easier time assessing with a single LEFT join and then doing some counts on the resulting data:

PROC SQL;
create table want as
select b.permno, a.bobnamesNew, b.company_name_header, a.bobnamesOriginal
from
    work.Temp A
    LEFT JOIN
    (select distinct permno, company_name_header from names.CRSPnames) B
    on a.bobnamesNew=b.company_name_header;

title "# distinct bobNames without a match in CRSP";
select count(distinct a.bobnamesNew) from WANT where missing(permno);
title;
QUIT;

Again, you're going to either have to try a fuzzy merge or, more likely, actually manually correct the bobnames. If there are just things like differences in case, whitespace, special characters, etc., then you probably could make this more automated, but you'd have to provide some examples here in order for people to help.

LinusH · Posted 03-05-2025 06:15 AM

I usually do a full join, and then subset if any of the join columns have a missing value.

Data never sleeps

ballardw · Posted 03-05-2025 01:13 AM

I have to say that without actual example data sets and how this was actually done

/* Bobnamesnew is the company name after I clean the name to increase chances of      */
/* finding a match in the CSRP database.                                                                            */

that I might suspect the "cleaning" step of being part of a problem. I have seen some "cleaning" that created matches for other values in a data set. This may happen if you have original names that are related, such as subsidiaries. Which may have different PERMNO but similar names...

Is there a reason there is no code showing attempted matches or searches using the PERMNO?

PROC SQL to find nonmatches

Re: PROC SQL to find nonmatches

Re: PROC SQL to find nonmatches

Re: PROC SQL to find nonmatches

PROC SQL to find nonmatches

Re: PROC SQL to find nonmatches

Re: PROC SQL to find nonmatches

Re: PROC SQL to find nonmatches

Registration is open