BookmarkSubscribeRSS Feed
Eileen1496
Obsidian | Level 7

Hi all,

 

I'm doing firm name fuzzy matching. I have a table called compustat, there is a column called conm which contains company names. I also have a table called firm_ref, which contains a column called company which contains company names. After turning them into case sensitive, drop common firm suffix (e.g. co, corp, etc), dropping special letters ($&,/+-, etc), I found the same firm could still have different names in the two column due to the existence of words without real meaning. 

For instance, we could have a firm name = "TIFFANY LUXURY RETAIL" in compustat, and a firm name = "TIFFANY LUXURY AND RETAIL" in firm_ref, I want to get rid of words such as "and" that does not related to the identity of the firm, how should I do it?

I know how to remove a word from a string if this word belongs to a list (e.g. a list of such meaningless words), but my problem is I do not have such a list. Is there any package like in python that provides a list of such words?

 

Thank you!

4 REPLIES 4
mkeintz
PROC Star

Presumably you are trying to match Compustat company data with data from another vendor of company-related data.

 

Before you work directly on fuzzy matching of names, do you have other information that might help, like phone numbers, zip codes, addresses, tickers?  If so, you might be able to reduce the population of unmatched data for submission to the fuzzy matching process.  Those other variables might also be helpful in supporting or invalidating fuzzy matches.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Eileen1496
Obsidian | Level 7
Yes I do have some firm identifiers, but they miss a lot in the external
dataset. I already conduct some rounds of matches using identifiers like
GVKEY to reduce the dataset for fuzzy name matching. I also have NAICS
code, but I think they are complement criteria to verify the matching
accuracy after name matching.

And for name matching, I get rid of common firm suffixes (e.g. corp, ltd,
etc) , special characters, parenthesis (like TIFFANY(UK) becomes TIFFANY),
so I can conduct some exact name matching. Then for the rest, I could
further match if I can get rid of words like "and", "the", etc.
mkeintz
PROC Star

I am unaware of a ready-made list of words that you might use for removal for fuzzy matching of company names.  Clearly your examples "and" and "the" should be included.

 

You might be able to leverage your data to build your own list.  You could do a frequency table of all words found in CONM, by word length, and a similar table for COMPANY.  That might suggest some words that wouldn't otherwise seem likely.  I imagine most words on the removal list would be short.

 

As a general rule, I presume that false positives (i.e. erroneous fuzzy matches) would be a far more significant problem than false negatives.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Eileen1496
Obsidian | Level 7
Getting a frequency table and then decide my own list is a good advice! Indeed when I try to match using tfidf before in Python, even two firms with different names, as long as they have one words in common they have very high score. I need to figure out this later.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1470 views
  • 0 likes
  • 2 in conversation