SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Separating names in a cell and then fuzzy merging

Reply
Contributor
Posts: 51

Separating names in a cell and then fuzzy merging

Hello everyone,

 

I have an excel that I imported into SAS. The excel file contains information on companies that I need to separate into columns and then I want to match the observation on those company names that I have in another data file. 

 

There are at least 2 company names in each cell and can be up to 19 company names. Ex;

 

                   cname 

 

row1            Microsoft Inc

                    Google Inc 

 

row 2           General electric

                    General motors

 

Row 3          General motors

                    Ford 

                    General electric

 

What I want is ----> 

 

             cname1                  cname2                    cname3

row1      Microsoft Inc           Google Inc  

row2      General electric       General Motors

row3      General motors        Ford                    General electric 

 

 

AND THEN what I want to do is match the observation on company names stored in another file. However, the company names in the other file wont always be exact matches per se.....   

 

e.g.,   Microsoft Inc = Microsoft            or      Google inc = Google         or          Ford = Ford motors           or    Motorola Inc = motorola communications           or         du pont =  ei dupont de nemours

 

 

Ive attached some screenshots so that you can get a sense of what Im working with..Thanks for any help in advance! 

 


Screenshot (25).png
Contributor
Posts: 51

Re: Separating names in a cell and then fuzzy merging


Screenshot (2).png
Super User
Posts: 10,023

Re: Separating names in a cell and then fuzzy merging

It depends on the delimiter between these company name. Try take  '0D0A'x  as delimiter.

Contributor
Posts: 51

Re: Separating names in a cell and then fuzzy merging

Thanks, Ksharp - that part worked great! Still struggling with the fuzzy match part, however.
Super User
Posts: 10,023

Re: Separating names in a cell and then fuzzy merging

There are a couple of functions can do fuzzy matching in SAS.

Some like  SPEDIS() , COMPLEV(),COMPGEN() ...........

Contributor
Posts: 51

Re: Separating names in a cell and then fuzzy merging

Yes, I did some reading about those.. but I dont know how to go about this / how it looks. Im fairly used to these type of merge functions:

proc sql;
create table adtr.a1 as select a.*, b.acode, b.bcode
from adtr.bld2 as a left join adtr.all4 as b
on a.year_code = b.year_code;
quit;


However, I would want to merge on the firm name (which wouldnt always be identical) and the year.
Super User
Posts: 19,787

Re: Separating names in a cell and then fuzzy merging

Fuzzy matching isn't straightforward.

 

For example will you be matching on the various variables you've now split out or only the first one.


What happens if you have an exact match? How close is good enough in fuzzy matching for you? It's a bit of a trial and error process that gets customized to your particular situation. Do you want it iterative, first loop looks for exact, second uses soundex, third uses compged, etc?

 

There's a bunch of SAS papers on the topic available here:

http://www.lexjansen.com/search/searchresults.php?q=fuzzy%20matching

Contributor
Posts: 51

Re: Separating names in a cell and then fuzzy merging

Thanks for your reply. 

 

I will be matching on both of the variables eventually, but only one at a time. 

 

I now have two datasets, each have a firm name and a year. And I need to merge them on those two variables because I want to merge in the rest of the information that each observation contains. They are firm names, so the names are similar, but not always exact. 

 

Something like 

Data set 1                                       Data set 2

Firm                                                 Firm 

Google Inc                                     Google 

Ford Motor                                      Ford

Motorola holdings                            Motorola solutions 

 

 

Sometimes they will be an exact match, sometimes they will not be. I would like a reasonable error variance, but obviously not too much where it's creating too many unnecessary matches. And yes, ideally it would be iterative. Bear in mind that I am by no means an advanced programmer, but do use sas to get around (probably often using simpler code in more steps than someone else could do in one step). I usually use a command like this to merge 

 

proc sql;
create table adtr.a1 as select a.*, b.address1, b.address2
from adtr.stage as a left join adtr.test as b
on a.code= b.code and a.year=b.year; 
quit;

 

 I would like to fuzzy match on firm name and also on year. 

 

Thanks for your help! 

Super User
Posts: 19,787

Re: Separating names in a cell and then fuzzy merging


r4321 wrote:

 

 

Sometimes they will be an exact match, sometimes they will not be. I would like a reasonable error variance, but obviously not too much where it's creating too many unnecessary matches. And yes, ideally it would be iterative.

 

Thanks for your help! 


Define "reasonable error variance"?

Super User
Posts: 19,787

Re: Separating names in a cell and then fuzzy merging

See the solution from Fried Egg here:

 

https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780

 

Another alternative is to clean up your names/standardize them somehow so you can do an exact merge.

 

Contributor
Posts: 51

Re: Separating names in a cell and then fuzzy merging

At least a few letters, but the important thing is that it's matching on the main part of the name. e.g.,
Ford = ford motor
Super User
Posts: 10,023

Re: Separating names in a cell and then fuzzy merging

Fuzzy matching is very tough task. Here is an example :

 

 

data have;
input stores $32.;
cards;
ABC STORE
APPLE COMPUTERS
MICROSOFT CORPORATION
COSTCO STORE
WALMART
SAMS CLUB
;
run;
proc sql;
 select a.*,b.stores as matched_stores
  from have as a,have as b
   group by a.stores
    having spedis(a.stores,b.stores)=min(spedis(a.stores,b.stores));
quit;
Contributor
Posts: 51

Re: Separating names in a cell and then fuzzy merging

Took a very long time to run ..Maybe an hour... and screwed up on an error...Kinda stuck on this.

3092 proc sql;
3093 select a.*,b.firm1 as matched_stores
3094 from adtr.stage as a,adtr.test as b
3095 group by a.conm
3096 having spedis(a.conm,b.firm1)=min(spedis(a.conm,b.firm1));
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Writing HTML Body file: sashtml.htm
ERROR: Sort execution failure.
3097 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 1:08:53.67
cpu time 26:32.45

Super User
Posts: 10,023

Re: Separating names in a cell and then fuzzy merging

How big are these two tables ?

Suggestion is spliting the first table adtr.stage into lots of small tables ,and make a macro go through all these sub-table,and remerge them all back together .

Ask a Question
Discussion stats
  • 13 replies
  • 516 views
  • 1 like
  • 3 in conversation