SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Merging two databases with two common variables, one of common variables have inexact characters

Reply
New Contributor FA1
New Contributor
Posts: 2

Merging two databases with two common variables, one of common variables have inexact characters

Hello,

I want two merge two databases:1) KLD, 2)Compustat

They have two common variables: 1)Year and 2)Company Name. 

Because my data are panel data I have to use both common variables to merge. 

However, company name is not exactly similar in two databases (they may different in terms of  capital letters, abbreviations, spelling errors, etc.)

Could you please let me know the coding of how merging these two databases.

I am not familiar with SAS, but I have found out SAS can solve this problem. 

If somebody give me the whole codes for this purpose, I would be very thankful.

 

Kind regards,

FA1 

 

Super Contributor
Posts: 498

Re: Merging two databases with two common variables, one of common variables have inexact characters

You will have to fix all those defects to merge the datasets. Depending on the modules you have licensed, this is more or less painful.

  1. Change company name to all upcase chars.
  2. Replace abbreviations
  3. Find and fix spelling errors.

Sorry, but data-cleaning is not the most joyful task one can have.

Super User
Posts: 9,549

Re: Merging two databases with two common variables, one of common variables have inexact characters

Your first step will be to get to know your data and identify the data quality problems in detail. From there you can develop the necessary steps to rectify them.

Run proc contents first to see if you have discrepancies in terms of variable types and variable sizes.

Then run proc sort with nodupkey on your datasets (use the out= option to create new, intermediate datasets) and then compare the results, and look if you have multiple entries for the same business object.

 

Now, if you have SAS Data Quality licensed, this tool can do most of this work for you.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
New Contributor FA1
New Contributor
Posts: 2

Re: Merging two databases with two common variables, one of common variables have inexact characters

Posted in reply to KurtBremser

Dear KurBremser                    

Thank you so Much.

Yes, SAS Data Quality Server is the best option for me to solve this problem.

Actually there are two ways to overcome this problem 1)Scoring method, 2) Fuzzy matching using SAS Data Quality Server. I have attached a pdf that explains these two ways very well. It was very helpful for me. 

 

Kind regards,

FA1

Trusted Advisor
Posts: 1,283

Re: Merging two databases with two common variables, one of common variables have inexact characters

KLD and Compustat also have other common variables:  CIK (Central Index Key, issued by the SEC), and stock ticker.  Now stock ticker for a given company can change over time, so you would have to be aware of synchronization issues.  And the database vendors may not have fully populated CIK. But I'd suggest getting those variables for matching purposes.

Ask a Question
Discussion stats
  • 4 replies
  • 163 views
  • 0 likes
  • 4 in conversation