BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cmoore
Obsidian | Level 7

Hi,

 

I have two data sources of information that I want to merge together. The common matchkey across both sources is the Client Name. However, there are different variations of spellings for the same Client. Please see below a small example of the two datasets I have and what I would like it to be transformed to;

 

 

Have1  Have2 
     
Client_NameSector Client_NameSub_Sector
3i Group PlcGlobal Financial Investors 3iPrivate Equity
3m CompanyIndustrials 3M Company (NYSE:MMM)Manufacturing industries
Ag Barr PlcConsumer and Healthcare Group A.P. Møller - Maersk A/S (CPSE:MAERSK B)Transport
MaerskEnergy, Resources and Infrastructure AB Volvo (publ) (OM:VOLV B)Automotive
Aa PlcIndustrials AbertisInfrastructure funds
ElectroluxConsumer and Healthcare Group BTTelecoms
VolvoIndustrials 3M Holding CompanyManufacturing industries
Abb LtdIndustrials AB Electrolux (publ) (OM:ELUX B)Consumer
Abbott LaboratoriesConsumer and Healthcare Group RBSBanks
Abbvie IncConsumer and Healthcare Group ABBManufacturing industries
Aberdeen Asset Management PlcFinancial Institutions Group AAM PLC (LSE:ADN)Asset managers
Aberforth Smaller CompaniesFinancial Institutions Group AbbVie IncorpHealth
Abertis Infraestructuras SaGlobal Financial Investors Abbott Laboratories (NYSE:ABT)Health
Abn Amro Group NvFinancial Institutions Group Abertis Infraestructuras, S.A. (BME:ABE)Infrastructure funds
Britsh TelecommunicationsTechnology, Media and Telecoms AG BARConsumer
Royal Bank of Scotland PLCFinancial Institutions Group AABusiness Services
The Coca Cola CompanyConsumer and Healthcare Group Coca colaConsumer

 

And the final output should look like;

 

Want  
   
Client NameSectorSub_Sector
3i Group PlcGlobal Financial InvestorsPrivate Equity
3m CompanyIndustrialsManufacturing industries
Ag Barr PlcConsumer and Healthcare GroupConsumer
MaerskEnergy, Resources and InfrastructureTransport
Aa PlcIndustrialsBusiness Services
ElectroluxConsumer and Healthcare GroupConsumer
VolvoIndustrials"Unknown"
Abb LtdIndustrialsManufacturing industries
Abbott LaboratoriesConsumer and Healthcare GroupHealth
Abbvie IncConsumer and Healthcare GroupHealth
Aberdeen Asset Management PlcFinancial Institutions GroupAsset managers
Aberforth Smaller CompaniesFinancial Institutions GroupInfrastructure funds
Abertis Infraestructuras SaGlobal Financial InvestorsInfrastructure funds
Abn Amro Group NvFinancial Institutions Group"Unknown"
Britsh TelecommunicationsTechnology, Media and TelecomsTelecoms
Royal Bank of Scotland PLCFinancial Institutions GroupBanks
The Coca Cola CompanyConsumer and Healthcare Group

Consumer

 

Is there a reliable way for me to fuzzy match across the different client names? Any help would be greatly appreciated.

 

Many thanks

 

Chris

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

@Reeza I would add one step to reduce the size of the cartesian product

 

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

 

COMPGED works best in my experience

 

View solution in original post

23 REPLIES 23
LinusH
Tourmaline | Level 20

How reliable do you need to be?

The most managed way to this is to use Data Management Studio.

Data never sleeps
cmoore
Obsidian | Level 7

It doesn't have to be perfect to be honest. 70% plus maybe. Just need a good way to fuzzy match the example data set then take it from there.

LinusH
Tourmaline | Level 20
70% what kind of analysis can you do with that little accuracy?
Given your examples (and my limited knowledge in fuzzy logic) I think together can apply some basic rules like matching on uppercase first two words (normalized), or create an extensive mapping table (or a combination of both).
Or don't reinvent the well a pay for some capable SW.
Data never sleeps
Reeza
Super User

This is a non-trivial problem because you end up doing N*M calculations which can be quite big. 

Some tips:

1. Upcase everything

2. Change all & or other symbols such as @ from text

3. Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited

4. Do an exact match and remove those records

5. Remove all spaces and try a second round of exact matches

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

 

 

ChrisNZ
Tourmaline | Level 20

@Reeza I would add one step to reduce the size of the cartesian product

 

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

 

COMPGED works best in my experience

 

cmoore
Obsidian | Level 7

Thanks for the information. I have a list of headline titles containing company names in and company lookup table. Please see below;

 

 

Have 1Have 2
Headline_TitleClient_NameFF_Sector
3I in takeover talks3i Group PlcGlobal Financial Investors
AP Moller Maersk to invest in AI technologyMaerskEnergy, Resources and Infrastructure
to form partnership with EE BT GroupBT Group PLCPower, Utilities & Renewables
in fines due to miss selling EDF EDFPower, Utilities & Renewables
Abbott labs to sell off Africa armAbbot LaboratoriesConsumer and Healthcare Group
Bank RBS post profitRBSBanks

 

I want to extract out the list of company names from the text and match them to a company name lookup table that I have with cleansed company names. I want the table to look like this;

 

Want
Headline_TitleClient_NameFF_Sector
3I in takeover talks3i Group PlcGlobal Financial Investors
AP Moller Maersk to invest in AI technologyMaerskEnergy, Resources and Infrastructure
to form partnership with EE BT GroupBT Group PLCPower, Utilities & Renewables
in fines due to miss selling EDF EDFPower, Utilities & Renewables
Abbott labs to sell off Africa armAbbot LaboratoriesConsumer and Healthcare Group
Bank RBS post profitRBSBanks

 

I have thousands of records to match to and this is just an example.

 

Would you happen to know what the best way is to do this?

 

Thanks

Chris

ChrisNZ
Tourmaline | Level 20

Answers were already given. What is unclear in all the explanations above? 

EJAA
Obsidian | Level 7

Hello cmoore,

I have a similar problem like this your problem you posted and want to ask if you could share the codes with me so i modify to suit my work. Am trying to do director name matching from two different database. Could you please help me out with your codes that helped you solve this your problem. Thanks very. 

cmoore
Obsidian | Level 7

 

Yes I'll help you know problem. Please could you provide me with 10 examples of data that you are trying to match.

 

Thanks

Chris

EJAA
Obsidian | Level 7

Hello cmoore, please find below details of my tables, just a small portion of it. I want to merge the two table using the director names which runs into thousands of rows.Thanks in advance for the assistance. Regards Emmanuel.

 

TABLE 1

 

 

exec_fname

exec_fullname

exec_lname

William

William P. Sullivan

Sullivan

Michael

Michael R. McMullen

McMullen

Didier

Didier Hirsch

Hirsch

Ronald

Ronald S. Nersesian

Nersesian

Adrian

Adrian T. Dillon

Dillon

Nicolas

Nicolas H. Roelofs, 

Roelofs, 

Ronald

Ronald S. Nersesian

Nersesian

Nicolas

Nicolas H. Roelofs, 

Roelofs, 

Michael

Michael R. McMullen

McMullen

 

 

Table 2

 

DIRECTORNAME

Degree

William P. Sullivan

MBA

Michael R. McMullen

BSC

Didier Hirsch

BSC

Ronald S. Nersesian

PHD

Adrian T. Dillon

MSC

Nicolas H. Roelofs, 

MA

Ronald S. Nersesian

MBA

Nicolas H. Roelofs, 

MBA

Michael R. McMullen

MBA

cmoore
Obsidian | Level 7

Are all the names the same spellings across the tables? Here is some rough code that cleanses the name on both tables then joins on it. If the spellings werent the same you would need to use fuzzy matching techniques.

 

data s1;

set s1;

format Cleaned_exec_fullname $255.;

/*Name*/
Cleaned_exec_fullname = scan(exec_fullname, 1,'()');
Cleaned_exec_fullname = upcase(Cleaned_exec_fullname);
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,",,.,/,*,',","");
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,"""");

Cleaned_exec_fullname = strip(Cleaned_exec_fullname);
Cleaned_exec_fullname = compbl(Cleaned_exec_fullname);

run;

 

data s2;

set s2;

format Cleaned_directorname $255.;

/* Name*/
Cleaned_directorname = scan(directorname, 1,'()');
Cleaned_directorname = upcase(Cleaned_directorname);
Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");
Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);
Cleaned_directorname = compbl(Cleaned_directorname);

run;

proc sql noprint;


**Selecting exact matches;
create table Exact_matches_Comp as
select s1.exec_fullname as Original_Exec_Name,
s1.Cleaned_exec_fullname as Cleaned_Exec_Name,
s2.directorname as Original_Director_Name,
s2.Cleaned_directorname as Cleaned_Director_Name,
s2.degree

from s1 inner join s2
on s1.Cleaned_exec_fullname = s2.Cleaned_directorname
where s1.Cleaned_exec_fullname is not null;
quit;

proc sort data=Exact_matches_Comp nodupkey; by Cleaned_Director_Name degree; run;

EJAA
Obsidian | Level 7

hi Cmoore, 

 

Thanks for sending through the codes and i apologise for the late update. Had no access to my computer during  the weekend hence the late reply. Am now going to try the codes you sent. 

 

However, the names in both dataset are not almost the same, some differs. What i sent was to aid your understanding of how my dataset looked like. I think fuzzy matching will be the best way out as you mentioned. Any help on that as well? Thanks in advance for the help. Regards EJAA. 

ChrisNZ
Tourmaline | Level 20

> the names in both dataset are not almost the same, some differs.

 

Why do you supply identical names as examples then? You make it hard to help you.

EJAA
Obsidian | Level 7

Thanks ChrisNZ. 

Advice well taken for future correspondence. 

Kindly find attached sample of my dataset and thanks in advance for the assistance.regards.ejaa. 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 23 replies
  • 7756 views
  • 5 likes
  • 6 in conversation