10-12-2013 06:58 PM
I have two data sets. I want to merge them with company names. But there are some variations in the company names. For example, mgmt vs. management, company vs. co, limited vs. ltd, and vs. &, etc. Also, there might be extra blanks in the names. Can someone tell me how to deal with this issue? And some reference I can use to do this merge? THanks.
10-13-2013 04:02 AM
there will be a supporting variable for company name in numeric (i think) if it is in this cases. check for that variable which will have some code for each company;
if there no such thing just make it through hard coding like,
if companyname='management' then companycode=1
in both datasets then you can merge them by this new variable created.
10-13-2013 06:18 PM
I would suggest you undertake some analysis to identify and replace strings such as 'co' with 'company' and the like. SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition may help you replace these.
As far as merging by name is concerned you could sing a fuzzy merge using COMPGED http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf. This will require some analysis and feel for your data, but I have has success with it in the past.
10-13-2013 07:37 PM
You need to clean the data. But that is not easy. So there are two ways.
1. sort and summarise the data by the company name. So you should be able to see the differences and then use this result to clean the company names using tranwrd function applying to the main data.
You can use all of the above in the same data step.
after the cleaning again summarise and check and clean. You might have to summarise few times until the entire data is cleaned.
2. Depending on you data, you can also follow the above procedure but before doing that you can split the company name in two or three parts using the scan function and then summarise on those words. This way you should be able to find the different spellings and clean them.