Hi,
I have two data sources of information that I want to merge together. The common matchkey across both sources is the Client Name. However, there are different variations of spellings for the same Client. Please see below a small example of the two datasets I have and what I would like it to be transformed to;
Have1 | Have2 | |||
Client_Name | Sector | Client_Name | Sub_Sector | |
3i Group Plc | Global Financial Investors | 3i | Private Equity | |
3m Company | Industrials | 3M Company (NYSE:MMM) | Manufacturing industries | |
Ag Barr Plc | Consumer and Healthcare Group | A.P. Møller - Maersk A/S (CPSE:MAERSK B) | Transport | |
Maersk | Energy, Resources and Infrastructure | AB Volvo (publ) (OM:VOLV B) | Automotive | |
Aa Plc | Industrials | Abertis | Infrastructure funds | |
Electrolux | Consumer and Healthcare Group | BT | Telecoms | |
Volvo | Industrials | 3M Holding Company | Manufacturing industries | |
Abb Ltd | Industrials | AB Electrolux (publ) (OM:ELUX B) | Consumer | |
Abbott Laboratories | Consumer and Healthcare Group | RBS | Banks | |
Abbvie Inc | Consumer and Healthcare Group | ABB | Manufacturing industries | |
Aberdeen Asset Management Plc | Financial Institutions Group | AAM PLC (LSE:ADN) | Asset managers | |
Aberforth Smaller Companies | Financial Institutions Group | AbbVie Incorp | Health | |
Abertis Infraestructuras Sa | Global Financial Investors | Abbott Laboratories (NYSE:ABT) | Health | |
Abn Amro Group Nv | Financial Institutions Group | Abertis Infraestructuras, S.A. (BME:ABE) | Infrastructure funds | |
Britsh Telecommunications | Technology, Media and Telecoms | AG BAR | Consumer | |
Royal Bank of Scotland PLC | Financial Institutions Group | AA | Business Services | |
The Coca Cola Company | Consumer and Healthcare Group | Coca cola | Consumer |
And the final output should look like;
Want | ||
Client Name | Sector | Sub_Sector |
3i Group Plc | Global Financial Investors | Private Equity |
3m Company | Industrials | Manufacturing industries |
Ag Barr Plc | Consumer and Healthcare Group | Consumer |
Maersk | Energy, Resources and Infrastructure | Transport |
Aa Plc | Industrials | Business Services |
Electrolux | Consumer and Healthcare Group | Consumer |
Volvo | Industrials | "Unknown" |
Abb Ltd | Industrials | Manufacturing industries |
Abbott Laboratories | Consumer and Healthcare Group | Health |
Abbvie Inc | Consumer and Healthcare Group | Health |
Aberdeen Asset Management Plc | Financial Institutions Group | Asset managers |
Aberforth Smaller Companies | Financial Institutions Group | Infrastructure funds |
Abertis Infraestructuras Sa | Global Financial Investors | Infrastructure funds |
Abn Amro Group Nv | Financial Institutions Group | "Unknown" |
Britsh Telecommunications | Technology, Media and Telecoms | Telecoms |
Royal Bank of Scotland PLC | Financial Institutions Group | Banks |
The Coca Cola Company | Consumer and Healthcare Group | Consumer |
Is there a reliable way for me to fuzzy match across the different client names? Any help would be greatly appreciated.
Many thanks
Chris
@Reeza I would add one step to reduce the size of the cartesian product
6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter
7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE
COMPGED works best in my experience
How reliable do you need to be?
The most managed way to this is to use Data Management Studio.
It doesn't have to be perfect to be honest. 70% plus maybe. Just need a good way to fuzzy match the example data set then take it from there.
This is a non-trivial problem because you end up doing N*M calculations which can be quite big.
Some tips:
1. Upcase everything
2. Change all & or other symbols such as @ from text
3. Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited
4. Do an exact match and remove those records
5. Remove all spaces and try a second round of exact matches
6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE
@Reeza I would add one step to reduce the size of the cartesian product
6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter
7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE
COMPGED works best in my experience
Thanks for the information. I have a list of headline titles containing company names in and company lookup table. Please see below;
Have 1 | Have 2 | |
Headline_Title | Client_Name | FF_Sector |
3I in takeover talks | 3i Group Plc | Global Financial Investors |
AP Moller Maersk to invest in AI technology | Maersk | Energy, Resources and Infrastructure |
to form partnership with EE BT Group | BT Group PLC | Power, Utilities & Renewables |
in fines due to miss selling EDF | EDF | Power, Utilities & Renewables |
Abbott labs to sell off Africa arm | Abbot Laboratories | Consumer and Healthcare Group |
Bank RBS post profit | RBS | Banks |
I want to extract out the list of company names from the text and match them to a company name lookup table that I have with cleansed company names. I want the table to look like this;
Want | ||
Headline_Title | Client_Name | FF_Sector |
3I in takeover talks | 3i Group Plc | Global Financial Investors |
AP Moller Maersk to invest in AI technology | Maersk | Energy, Resources and Infrastructure |
to form partnership with EE BT Group | BT Group PLC | Power, Utilities & Renewables |
in fines due to miss selling EDF | EDF | Power, Utilities & Renewables |
Abbott labs to sell off Africa arm | Abbot Laboratories | Consumer and Healthcare Group |
Bank RBS post profit | RBS | Banks |
I have thousands of records to match to and this is just an example.
Would you happen to know what the best way is to do this?
Thanks
Chris
Answers were already given. What is unclear in all the explanations above?
Hello cmoore,
I have a similar problem like this your problem you posted and want to ask if you could share the codes with me so i modify to suit my work. Am trying to do director name matching from two different database. Could you please help me out with your codes that helped you solve this your problem. Thanks very.
Yes I'll help you know problem. Please could you provide me with 10 examples of data that you are trying to match.
Thanks
Chris
Hello cmoore, please find below details of my tables, just a small portion of it. I want to merge the two table using the director names which runs into thousands of rows.Thanks in advance for the assistance. Regards Emmanuel.
TABLE 1 |
|
|
exec_fname | exec_fullname | exec_lname |
William | William P. Sullivan | Sullivan |
Michael | Michael R. McMullen | McMullen |
Didier | Didier Hirsch | Hirsch |
Ronald | Ronald S. Nersesian | Nersesian |
Adrian | Adrian T. Dillon | Dillon |
Nicolas | Nicolas H. Roelofs, | Roelofs, |
Ronald | Ronald S. Nersesian | Nersesian |
Nicolas | Nicolas H. Roelofs, | Roelofs, |
Michael | Michael R. McMullen | McMullen |
Table 2 |
|
DIRECTORNAME | Degree |
William P. Sullivan | MBA |
Michael R. McMullen | BSC |
Didier Hirsch | BSC |
Ronald S. Nersesian | PHD |
Adrian T. Dillon | MSC |
Nicolas H. Roelofs, | MA |
Ronald S. Nersesian | MBA |
Nicolas H. Roelofs, | MBA |
Michael R. McMullen | MBA |
Are all the names the same spellings across the tables? Here is some rough code that cleanses the name on both tables then joins on it. If the spellings werent the same you would need to use fuzzy matching techniques.
data s1;
set s1;
format Cleaned_exec_fullname $255.;
/*Name*/
Cleaned_exec_fullname = scan(exec_fullname, 1,'()');
Cleaned_exec_fullname = upcase(Cleaned_exec_fullname);
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,",,.,/,*,',","");
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,"""");
Cleaned_exec_fullname = strip(Cleaned_exec_fullname);
Cleaned_exec_fullname = compbl(Cleaned_exec_fullname);
run;
data s2;
set s2;
format Cleaned_directorname $255.;
/* Name*/
Cleaned_directorname = scan(directorname, 1,'()');
Cleaned_directorname = upcase(Cleaned_directorname);
Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");
Cleaned_directorname = compress(Cleaned_directorname,"""");
Cleaned_directorname = strip(Cleaned_directorname);
Cleaned_directorname = compbl(Cleaned_directorname);
run;
proc sql noprint;
**Selecting exact matches;
create table Exact_matches_Comp as
select s1.exec_fullname as Original_Exec_Name,
s1.Cleaned_exec_fullname as Cleaned_Exec_Name,
s2.directorname as Original_Director_Name,
s2.Cleaned_directorname as Cleaned_Director_Name,
s2.degree
from s1 inner join s2
on s1.Cleaned_exec_fullname = s2.Cleaned_directorname
where s1.Cleaned_exec_fullname is not null;
quit;
proc sort data=Exact_matches_Comp nodupkey; by Cleaned_Director_Name degree; run;
hi Cmoore,
Thanks for sending through the codes and i apologise for the late update. Had no access to my computer during the weekend hence the late reply. Am now going to try the codes you sent.
However, the names in both dataset are not almost the same, some differs. What i sent was to aid your understanding of how my dataset looked like. I think fuzzy matching will be the best way out as you mentioned. Any help on that as well? Thanks in advance for the help. Regards EJAA.
> the names in both dataset are not almost the same, some differs.
Why do you supply identical names as examples then? You make it hard to help you.
Thanks ChrisNZ.
Advice well taken for future correspondence.
Kindly find attached sample of my dataset and thanks in advance for the assistance.regards.ejaa.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.