Hi all,
I'm doing firm name fuzzy matching. I have a table called compustat, there is a column called conm which contains company names. I also have a table called firm_ref, which contains a column called company which contains company names. After turning them into case sensitive, drop common firm suffix (e.g. co, corp, etc), dropping special letters ($&,/+-, etc), I found the same firm could still have different names in the two column due to the existence of words without real meaning.
For instance, we could have a firm name = "TIFFANY LUXURY RETAIL" in compustat, and a firm name = "TIFFANY LUXURY AND RETAIL" in firm_ref, I want to get rid of words such as "and" that does not related to the identity of the firm, how should I do it?
I know how to remove a word from a string if this word belongs to a list (e.g. a list of such meaningless words), but my problem is I do not have such a list. Is there any package like in python that provides a list of such words?
Thank you!
Presumably you are trying to match Compustat company data with data from another vendor of company-related data.
Before you work directly on fuzzy matching of names, do you have other information that might help, like phone numbers, zip codes, addresses, tickers? If so, you might be able to reduce the population of unmatched data for submission to the fuzzy matching process. Those other variables might also be helpful in supporting or invalidating fuzzy matches.
I am unaware of a ready-made list of words that you might use for removal for fuzzy matching of company names. Clearly your examples "and" and "the" should be included.
You might be able to leverage your data to build your own list. You could do a frequency table of all words found in CONM, by word length, and a similar table for COMPANY. That might suggest some words that wouldn't otherwise seem likely. I imagine most words on the removal list would be short.
As a general rule, I presume that false positives (i.e. erroneous fuzzy matches) would be a far more significant problem than false negatives.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.