Solved: Re: Matching Company Names

cmoore · Posted 02-23-2018 07:45 AM

Hi,

I have two data sources of information that I want to merge together. The common matchkey across both sources is the Client Name. However, there are different variations of spellings for the same Client. Please see below a small example of the two datasets I have and what I would like it to be transformed to;

Have1		Have2

Client_Name	Sector	Client_Name	Sub_Sector
3i Group Plc	Global Financial Investors	3i	Private Equity
3m Company	Industrials	3M Company (NYSE:MMM)	Manufacturing industries
Ag Barr Plc	Consumer and Healthcare Group	A.P. Møller - Maersk A/S (CPSE:MAERSK B)	Transport
Maersk	Energy, Resources and Infrastructure	AB Volvo (publ) (OM:VOLV B)	Automotive
Aa Plc	Industrials	Abertis	Infrastructure funds
Electrolux	Consumer and Healthcare Group	BT	Telecoms
Volvo	Industrials	3M Holding Company	Manufacturing industries
Abb Ltd	Industrials	AB Electrolux (publ) (OM:ELUX B)	Consumer
Abbott Laboratories	Consumer and Healthcare Group	RBS	Banks
Abbvie Inc	Consumer and Healthcare Group	ABB	Manufacturing industries
Aberdeen Asset Management Plc	Financial Institutions Group	AAM PLC (LSE:ADN)	Asset managers
Aberforth Smaller Companies	Financial Institutions Group	AbbVie Incorp	Health
Abertis Infraestructuras Sa	Global Financial Investors	Abbott Laboratories (NYSE:ABT)	Health
Abn Amro Group Nv	Financial Institutions Group	Abertis Infraestructuras, S.A. (BME:ABE)	Infrastructure funds
Britsh Telecommunications	Technology, Media and Telecoms	AG BAR	Consumer
Royal Bank of Scotland PLC	Financial Institutions Group	AA	Business Services
The Coca Cola Company	Consumer and Healthcare Group	Coca cola	Consumer

And the final output should look like;

Want

Client Name	Sector	Sub_Sector
3i Group Plc	Global Financial Investors	Private Equity
3m Company	Industrials	Manufacturing industries
Ag Barr Plc	Consumer and Healthcare Group	Consumer
Maersk	Energy, Resources and Infrastructure	Transport
Aa Plc	Industrials	Business Services
Electrolux	Consumer and Healthcare Group	Consumer
Volvo	Industrials	"Unknown"
Abb Ltd	Industrials	Manufacturing industries
Abbott Laboratories	Consumer and Healthcare Group	Health
Abbvie Inc	Consumer and Healthcare Group	Health
Aberdeen Asset Management Plc	Financial Institutions Group	Asset managers
Aberforth Smaller Companies	Financial Institutions Group	Infrastructure funds
Abertis Infraestructuras Sa	Global Financial Investors	Infrastructure funds
Abn Amro Group Nv	Financial Institutions Group	"Unknown"
Britsh Telecommunications	Technology, Media and Telecoms	Telecoms
Royal Bank of Scotland PLC	Financial Institutions Group	Banks
The Coca Cola Company	Consumer and Healthcare Group	Consumer

Is there a reliable way for me to fuzzy match across the different client names? Any help would be greatly appreciated.

Many thanks

Chris

ChrisNZ · Posted 02-27-2018 12:02 AM

@Reeza I would add one step to reduce the size of the cartesian product

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

COMPGED works best in my experience

High-Performance SAS Coding - Third Edition

View solution in original post

LinusH · Posted 02-23-2018 08:19 AM

How reliable do you need to be?

The most managed way to this is to use Data Management Studio.

Data never sleeps

cmoore · Posted 02-23-2018 09:07 AM

It doesn't have to be perfect to be honest. 70% plus maybe. Just need a good way to fuzzy match the example data set then take it from there.

LinusH · Posted 02-23-2018 11:10 AM

70% what kind of analysis can you do with that little accuracy?
Given your examples (and my limited knowledge in fuzzy logic) I think together can apply some basic rules like matching on uppercase first two words (normalized), or create an extensive mapping table (or a combination of both).
Or don't reinvent the well a pay for some capable SW.

Data never sleeps

Reeza · Posted 02-23-2018 03:04 PM

This is a non-trivial problem because you end up doing N*M calculations which can be quite big.

Some tips:

1. Upcase everything

2. Change all & or other symbols such as @ from text

3. Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited

4. Do an exact match and remove those records

5. Remove all spaces and try a second round of exact matches

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

ChrisNZ · Posted 02-27-2018 12:02 AM

@Reeza I would add one step to reduce the size of the cartesian product

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

COMPGED works best in my experience

High-Performance SAS Coding - Third Edition

cmoore · Posted 02-28-2018 06:13 AM

Thanks for the information. I have a list of headline titles containing company names in and company lookup table. Please see below;

Have 1	Have 2
Headline_Title	Client_Name	FF_Sector
3I in takeover talks	3i Group Plc	Global Financial Investors
AP Moller Maersk to invest in AI technology	Maersk	Energy, Resources and Infrastructure
to form partnership with EE BT Group	BT Group PLC	Power, Utilities & Renewables
in fines due to miss selling EDF	EDF	Power, Utilities & Renewables
Abbott labs to sell off Africa arm	Abbot Laboratories	Consumer and Healthcare Group
Bank RBS post profit	RBS	Banks

I want to extract out the list of company names from the text and match them to a company name lookup table that I have with cleansed company names. I want the table to look like this;

Want
Headline_Title	Client_Name	FF_Sector
3I in takeover talks	3i Group Plc	Global Financial Investors
AP Moller Maersk to invest in AI technology	Maersk	Energy, Resources and Infrastructure
to form partnership with EE BT Group	BT Group PLC	Power, Utilities & Renewables
in fines due to miss selling EDF	EDF	Power, Utilities & Renewables
Abbott labs to sell off Africa arm	Abbot Laboratories	Consumer and Healthcare Group
Bank RBS post profit	RBS	Banks

I have thousands of records to match to and this is just an example.

Would you happen to know what the best way is to do this?

Thanks

Chris

ChrisNZ · Posted 02-28-2018 04:19 PM

Answers were already given. What is unclear in all the explanations above?

High-Performance SAS Coding - Third Edition

EJAA · Posted 01-10-2019 02:36 AM

Hello cmoore,

I have a similar problem like this your problem you posted and want to ask if you could share the codes with me so i modify to suit my work. Am trying to do director name matching from two different database. Could you please help me out with your codes that helped you solve this your problem. Thanks very.

cmoore · Posted 01-10-2019 03:07 AM

Yes I'll help you know problem. Please could you provide me with 10 examples of data that you are trying to match.

Thanks

Chris

EJAA · Posted 01-10-2019 03:48 AM

Hello cmoore, please find below details of my tables, just a small portion of it. I want to merge the two table using the director names which runs into thousands of rows.Thanks in advance for the assistance. Regards Emmanuel.

TABLE 1
exec_fname	exec_fullname	exec_lname
William	William P. Sullivan	Sullivan
Michael	Michael R. McMullen	McMullen
Didier	Didier Hirsch	Hirsch
Ronald	Ronald S. Nersesian	Nersesian
Adrian	Adrian T. Dillon	Dillon
Nicolas	Nicolas H. Roelofs,	Roelofs,
Ronald	Ronald S. Nersesian	Nersesian
Nicolas	Nicolas H. Roelofs,	Roelofs,
Michael	Michael R. McMullen	McMullen

Table 2
DIRECTORNAME	Degree
William P. Sullivan	MBA
Michael R. McMullen	BSC
Didier Hirsch	BSC
Ronald S. Nersesian	PHD
Adrian T. Dillon	MSC
Nicolas H. Roelofs,	MA
Ronald S. Nersesian	MBA
Nicolas H. Roelofs,	MBA
Michael R. McMullen	MBA

cmoore · Posted 01-10-2019 05:43 AM

Are all the names the same spellings across the tables? Here is some rough code that cleanses the name on both tables then joins on it. If the spellings werent the same you would need to use fuzzy matching techniques.

data s1;

set s1;

format Cleaned_exec_fullname $255.;

/*Name*/
Cleaned_exec_fullname = scan(exec_fullname, 1,'()');
Cleaned_exec_fullname = upcase(Cleaned_exec_fullname);
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,",,.,/,*,',","");
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,"""");

Cleaned_exec_fullname = strip(Cleaned_exec_fullname);
Cleaned_exec_fullname = compbl(Cleaned_exec_fullname);

run;

data s2;

set s2;

format Cleaned_directorname $255.;

/* Name*/
Cleaned_directorname = scan(directorname, 1,'()');
Cleaned_directorname = upcase(Cleaned_directorname);
Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");
Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);
Cleaned_directorname = compbl(Cleaned_directorname);

run;

proc sql noprint;

**Selecting exact matches;
create table Exact_matches_Comp as
select s1.exec_fullname as Original_Exec_Name,
s1.Cleaned_exec_fullname as Cleaned_Exec_Name,
s2.directorname as Original_Director_Name,
s2.Cleaned_directorname as Cleaned_Director_Name,
s2.degree

from s1 inner join s2
on s1.Cleaned_exec_fullname = s2.Cleaned_directorname
where s1.Cleaned_exec_fullname is not null;
quit;

proc sort data=Exact_matches_Comp nodupkey; by Cleaned_Director_Name degree; run;

EJAA · Posted 01-13-2019 06:45 PM

hi Cmoore,

Thanks for sending through the codes and i apologise for the late update. Had no access to my computer during the weekend hence the late reply. Am now going to try the codes you sent.

However, the names in both dataset are not almost the same, some differs. What i sent was to aid your understanding of how my dataset looked like. I think fuzzy matching will be the best way out as you mentioned. Any help on that as well? Thanks in advance for the help. Regards EJAA.

ChrisNZ · Posted 01-13-2019 09:46 PM

> the names in both dataset are not almost the same, some differs.

Why do you supply identical names as examples then? You make it hard to help you.

High-Performance SAS Coding - Third Edition

EJAA · Posted 01-13-2019 10:22 PM

Thanks ChrisNZ.

Advice well taken for future correspondence.

Kindly find attached sample of my dataset and thanks in advance for the assistance.regards.ejaa.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away