BookmarkSubscribeRSS Feed
r4321
Pyrite | Level 9

Hello everyone,

 

I have an excel that I imported into SAS. The excel file contains information on companies that I need to separate into columns and then I want to match the observation on those company names that I have in another data file. 

 

There are at least 2 company names in each cell and can be up to 19 company names. Ex;

 

                   cname 

 

row1            Microsoft Inc

                    Google Inc 

 

row 2           General electric

                    General motors

 

Row 3          General motors

                    Ford 

                    General electric

 

What I want is ----> 

 

             cname1                  cname2                    cname3

row1      Microsoft Inc           Google Inc  

row2      General electric       General Motors

row3      General motors        Ford                    General electric 

 

 

AND THEN what I want to do is match the observation on company names stored in another file. However, the company names in the other file wont always be exact matches per se.....   

 

e.g.,   Microsoft Inc = Microsoft            or      Google inc = Google         or          Ford = Ford motors           or    Motorola Inc = motorola communications           or         du pont =  ei dupont de nemours

 

 

Ive attached some screenshots so that you can get a sense of what Im working with..Thanks for any help in advance! 

 


Screenshot (25).png
13 REPLIES 13
Ksharp
Super User

It depends on the delimiter between these company name. Try take  '0D0A'x  as delimiter.

r4321
Pyrite | Level 9
Thanks, Ksharp - that part worked great! Still struggling with the fuzzy match part, however.
Ksharp
Super User

There are a couple of functions can do fuzzy matching in SAS.

Some like  SPEDIS() , COMPLEV(),COMPGEN() ...........

r4321
Pyrite | Level 9
Yes, I did some reading about those.. but I dont know how to go about this / how it looks. Im fairly used to these type of merge functions:

proc sql;
create table adtr.a1 as select a.*, b.acode, b.bcode
from adtr.bld2 as a left join adtr.all4 as b
on a.year_code = b.year_code;
quit;


However, I would want to merge on the firm name (which wouldnt always be identical) and the year.
Reeza
Super User

Fuzzy matching isn't straightforward.

 

For example will you be matching on the various variables you've now split out or only the first one.


What happens if you have an exact match? How close is good enough in fuzzy matching for you? It's a bit of a trial and error process that gets customized to your particular situation. Do you want it iterative, first loop looks for exact, second uses soundex, third uses compged, etc?

 

There's a bunch of SAS papers on the topic available here:

http://www.lexjansen.com/search/searchresults.php?q=fuzzy%20matching

r4321
Pyrite | Level 9

Thanks for your reply. 

 

I will be matching on both of the variables eventually, but only one at a time. 

 

I now have two datasets, each have a firm name and a year. And I need to merge them on those two variables because I want to merge in the rest of the information that each observation contains. They are firm names, so the names are similar, but not always exact. 

 

Something like 

Data set 1                                       Data set 2

Firm                                                 Firm 

Google Inc                                     Google 

Ford Motor                                      Ford

Motorola holdings                            Motorola solutions 

 

 

Sometimes they will be an exact match, sometimes they will not be. I would like a reasonable error variance, but obviously not too much where it's creating too many unnecessary matches. And yes, ideally it would be iterative. Bear in mind that I am by no means an advanced programmer, but do use sas to get around (probably often using simpler code in more steps than someone else could do in one step). I usually use a command like this to merge 

 

proc sql;
create table adtr.a1 as select a.*, b.address1, b.address2
from adtr.stage as a left join adtr.test as b
on a.code= b.code and a.year=b.year; 
quit;

 

 I would like to fuzzy match on firm name and also on year. 

 

Thanks for your help! 

Reeza
Super User

@r4321 wrote:

 

 

Sometimes they will be an exact match, sometimes they will not be. I would like a reasonable error variance, but obviously not too much where it's creating too many unnecessary matches. And yes, ideally it would be iterative.

 

Thanks for your help! 


Define "reasonable error variance"?

Reeza
Super User

See the solution from Fried Egg here:

 

https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780

 

Another alternative is to clean up your names/standardize them somehow so you can do an exact merge.

 

r4321
Pyrite | Level 9
At least a few letters, but the important thing is that it's matching on the main part of the name. e.g.,
Ford = ford motor
Ksharp
Super User

Fuzzy matching is very tough task. Here is an example :

 

 

data have;
input stores $32.;
cards;
ABC STORE
APPLE COMPUTERS
MICROSOFT CORPORATION
COSTCO STORE
WALMART
SAMS CLUB
;
run;
proc sql;
 select a.*,b.stores as matched_stores
  from have as a,have as b
   group by a.stores
    having spedis(a.stores,b.stores)=min(spedis(a.stores,b.stores));
quit;
r4321
Pyrite | Level 9
Took a very long time to run ..Maybe an hour... and screwed up on an error...Kinda stuck on this.

3092 proc sql;
3093 select a.*,b.firm1 as matched_stores
3094 from adtr.stage as a,adtr.test as b
3095 group by a.conm
3096 having spedis(a.conm,b.firm1)=min(spedis(a.conm,b.firm1));
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Writing HTML Body file: sashtml.htm
ERROR: Sort execution failure.
3097 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 1:08:53.67
cpu time 26:32.45

Ksharp
Super User

How big are these two tables ?

Suggestion is spliting the first table adtr.stage into lots of small tables ,and make a macro go through all these sub-table,and remerge them all back together .

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1993 views
  • 1 like
  • 3 in conversation