05-18-2016 07:23 PM
I have an excel that I imported into SAS. The excel file contains information on companies that I need to separate into columns and then I want to match the observation on those company names that I have in another data file.
There are at least 2 company names in each cell and can be up to 19 company names. Ex;
row1 Microsoft Inc
row 2 General electric
Row 3 General motors
What I want is ---->
cname1 cname2 cname3
row1 Microsoft Inc Google Inc
row2 General electric General Motors
row3 General motors Ford General electric
AND THEN what I want to do is match the observation on company names stored in another file. However, the company names in the other file wont always be exact matches per se.....
e.g., Microsoft Inc = Microsoft or Google inc = Google or Ford = Ford motors or Motorola Inc = motorola communications or du pont = ei dupont de nemours
Ive attached some screenshots so that you can get a sense of what Im working with..Thanks for any help in advance!
05-18-2016 10:16 PM
05-18-2016 10:37 PM
Fuzzy matching isn't straightforward.
For example will you be matching on the various variables you've now split out or only the first one.
What happens if you have an exact match? How close is good enough in fuzzy matching for you? It's a bit of a trial and error process that gets customized to your particular situation. Do you want it iterative, first loop looks for exact, second uses soundex, third uses compged, etc?
There's a bunch of SAS papers on the topic available here:
05-19-2016 12:22 AM
Thanks for your reply.
I will be matching on both of the variables eventually, but only one at a time.
I now have two datasets, each have a firm name and a year. And I need to merge them on those two variables because I want to merge in the rest of the information that each observation contains. They are firm names, so the names are similar, but not always exact.
Data set 1 Data set 2
Google Inc Google
Ford Motor Ford
Motorola holdings Motorola solutions
Sometimes they will be an exact match, sometimes they will not be. I would like a reasonable error variance, but obviously not too much where it's creating too many unnecessary matches. And yes, ideally it would be iterative. Bear in mind that I am by no means an advanced programmer, but do use sas to get around (probably often using simpler code in more steps than someone else could do in one step). I usually use a command like this to merge
create table adtr.a1 as select a.*, b.address1, b.address2
from adtr.stage as a left join adtr.test as b
on a.code= b.code and a.year=b.year;
I would like to fuzzy match on firm name and also on year.
Thanks for your help!
05-19-2016 12:43 AM
Sometimes they will be an exact match, sometimes they will not be. I would like a reasonable error variance, but obviously not too much where it's creating too many unnecessary matches. And yes, ideally it would be iterative.
Thanks for your help!
Define "reasonable error variance"?
05-19-2016 12:46 AM
See the solution from Fried Egg here:
Another alternative is to clean up your names/standardize them somehow so you can do an exact merge.
05-18-2016 10:42 PM
Fuzzy matching is very tough task. Here is an example :
data have; input stores $32.; cards; ABC STORE APPLE COMPUTERS MICROSOFT CORPORATION COSTCO STORE WALMART SAMS CLUB ; run; proc sql; select a.*,b.stores as matched_stores from have as a,have as b group by a.stores having spedis(a.stores,b.stores)=min(spedis(a.stores,b.stores)); quit;
05-19-2016 12:14 AM
05-19-2016 04:54 AM
How big are these two tables ?
Suggestion is spliting the first table adtr.stage into lots of small tables ,and make a macro go through all these sub-table,and remerge them all back together .