Solved: Re: Matching Company Names - Page 2

ChrisNZ · Posted 01-13-2019 11:04 PM

There is no commonality between your two files.

How is this supposed to help us help you?

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 01-13-2019 11:10 PM

More to the point, have you tried fuzzy matching using the recommended steps above?

Some tips:

1. Upcase everything

2. Change all & or other symbols such as @ from text

3. Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited => This could be replaced to deal with middle names and/or initials, in the case of names

4. Do an exact match and remove those records

5. Remove all spaces and try a second round of exact matches

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

COMPGED works best in my experience

as well as @cmoore's cleaning steps (extract, upcase, compress, justify, compl) to implement point 2 above?

Cleaned_directorname = scan(directorname, 1,'()');

Cleaned_directorname = upcase(Cleaned_directorname);

Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");

Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);

Cleaned_directorname = compbl(Cleaned_directorname);

[Edited formatting]

High-Performance SAS Coding - Third Edition

cmoore · Posted 01-14-2019 04:01 AM

Hi,

The kind of coding you could you use to fuzzy match on name can be seen below. You just need to amend your variables to fit your data tables;

/*Non-Exact Matches - Use the Algorithm to match variables - on Company Names*/

proc sql;

create table NonExact_matches_Comp as

select s1.company_name as Original_Company_Name,
s1.cleaned_company_name as Cleaned_Company_name,
s2.cleaned_company_name as connect_company_name,
s2.cleaned_group_name as connect_group_name,
s2.client_group_number,
compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN')as compged,
spedis(s1.cleaned_company_name,s2.cleaned_company_name)as spedis

from s1
inner join
s2
on (compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN') < 200
and spedis(s1.cleaned_company_name,s2.cleaned_company_name) < 50)
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
having calculated compged = min(compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN'))
;
quit;

ChrisNZ · Posted 01-14-2019 04:31 PM

@cmoore This should be a multi-step process, going from best to worst match quality, as outlined above.

This ensures faster processing as well as better understanding of the quality of the results.

High-Performance SAS Coding - Third Edition

cmoore · Posted 01-15-2019 03:47 AM

@ChrisNZ yes total agree with your logic and those are the steps that I used to write my matching programme. The code above is just a snippet of the code that I used at the very end of the process where I needed to use fuzzy logic on poor data that was left after data cleansing and removing exact matches.

EJAA · Posted 01-15-2019 05:03 AM

Hello @cmoore and @ChrisNZ

Thanks very much for the directions. My problem resolved now. Appreciate the the assistance.

Regards

EJAA

France · Posted 01-16-2019 10:25 AM

hello EJAA,

I am trying to match two datasets by company name now. could you please share your code with me ?

thanks in advacne
best regards, France

EJAA · Posted 01-16-2019 10:45 PM

Hello France,

This the final code what i used to do the matching. Just do the necessary changes.It worked for me. Hope it helps. Regard EJAA.

data s1;

set s1;

format Cleaned_exec_fullname $255.;

/*Name*/
Cleaned_exec_fullname = scan(exec_fullname, 1,'()');
Cleaned_exec_fullname = upcase(Cleaned_exec_fullname);
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,",,.,/,*,',","");
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,"""");

Cleaned_exec_fullname = strip(Cleaned_exec_fullname);
Cleaned_exec_fullname = compbl(Cleaned_exec_fullname);

run;

data s2;

set s2;

format Cleaned_directorname $255.;

/* Name*/
Cleaned_directorname = scan(directorname, 1,'()');
Cleaned_directorname = upcase(Cleaned_directorname);
Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");
Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);
Cleaned_directorname = compbl(Cleaned_directorname);

run;

proc sql noprint;

**Selecting exact matches;
create table Exact_matches_Comp as
select s1.exec_fullname as Original_Exec_Name,
s1.Cleaned_exec_fullname as Cleaned_Exec_Name,
s2.directorname as Original_Director_Name,
s2.Cleaned_directorname as Cleaned_Director_Name,
s2.degree

from s1 inner join s2
on s1.Cleaned_exec_fullname = s2.Cleaned_directorname
where s1.Cleaned_exec_fullname is not null;
quit;

proc sort data=Exact_matches_Comp nodupkey; by Cleaned_Director_Name degree; run;

France · Posted 01-16-2019 11:46 PM

Hello cmoore,

I am trying to match two datasets by company name as well. could you please share your code with me ?

thanks in advacne
best regards, France

SAS Innovate 2025: Call for Content

Classroom Training Available!