BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ChrisNZ
Tourmaline | Level 20

There is no commonality between your two files. 

How is this supposed to help us help you?

ChrisNZ
Tourmaline | Level 20

More to the point, have you tried fuzzy matching using the recommended steps above?

 

Some tips:

1. Upcase everything

2. Change all & or other symbols such as @ from text

3. Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited => This could be replaced to deal with middle names and/or initials, in the case of names

4. Do an exact match and remove those records

5. Remove all spaces and try a second round of exact matches

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

 

COMPGED works best in my experience

 

as well as @cmoore's cleaning steps (extract, upcase, compress, justify, compl) to implement point 2 above?

 

Cleaned_directorname = scan(directorname, 1,'()');

Cleaned_directorname = upcase(Cleaned_directorname);

Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");

Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);

Cleaned_directorname = compbl(Cleaned_directorname);

 

[Edited formatting]

cmoore
Obsidian | Level 7

Hi,

 

The kind of coding you could you use to fuzzy match on name can be seen below. You just need to amend your variables to fit your data tables;

 

/*Non-Exact Matches - Use the Algorithm to match variables - on Company Names*/

proc sql;

create table NonExact_matches_Comp as

select s1.company_name as Original_Company_Name,
s1.cleaned_company_name as Cleaned_Company_name,
s2.cleaned_company_name as connect_company_name,
s2.cleaned_group_name as connect_group_name,
s2.client_group_number,
compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN')as compged,
spedis(s1.cleaned_company_name,s2.cleaned_company_name)as spedis

from s1
inner join
s2
on (compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN') < 200
and spedis(s1.cleaned_company_name,s2.cleaned_company_name) < 50)
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
having calculated compged = min(compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN'))
;
quit;

 

ChrisNZ
Tourmaline | Level 20

@cmoore  This should be a multi-step process, going from best to worst match quality, as outlined above.

 

This ensures faster processing as well as better understanding of the quality of the results.

cmoore
Obsidian | Level 7

@ChrisNZ yes total agree with your logic and those are the steps that I used to write my matching programme. The code above is just a snippet of the code that I used at the very end of the process where I needed to use fuzzy logic on poor data that was left after data cleansing and removing exact matches.

EJAA
Obsidian | Level 7

Hello @cmoore and @ChrisNZ

Thanks very much for the directions. My problem resolved now. Appreciate the the assistance. 

Regards

EJAA

France
Quartz | Level 8
hello EJAA,

I am trying to match two datasets by company name now. could you please share your code with me ?

thanks in advacne
best regards, France
EJAA
Obsidian | Level 7

Hello France,

This the final code what i used to do the matching. Just do the necessary changes.It worked for me.  Hope it helps. Regard EJAA. 

 

data s1;

set s1;

format Cleaned_exec_fullname $255.;

/*Name*/
Cleaned_exec_fullname = scan(exec_fullname, 1,'()');
Cleaned_exec_fullname = upcase(Cleaned_exec_fullname);
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,",,.,/,*,',","");
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,"""");

Cleaned_exec_fullname = strip(Cleaned_exec_fullname);
Cleaned_exec_fullname = compbl(Cleaned_exec_fullname);

run;

 

data s2;

set s2;

format Cleaned_directorname $255.;

/* Name*/
Cleaned_directorname = scan(directorname, 1,'()');
Cleaned_directorname = upcase(Cleaned_directorname);
Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");
Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);
Cleaned_directorname = compbl(Cleaned_directorname);

run;

proc sql noprint;


**Selecting exact matches;
create table Exact_matches_Comp as
select s1.exec_fullname as Original_Exec_Name,
s1.Cleaned_exec_fullname as Cleaned_Exec_Name,
s2.directorname as Original_Director_Name,
s2.Cleaned_directorname as Cleaned_Director_Name,
s2.degree

from s1 inner join s2
on s1.Cleaned_exec_fullname = s2.Cleaned_directorname
where s1.Cleaned_exec_fullname is not null;
quit;

proc sort data=Exact_matches_Comp nodupkey; by Cleaned_Director_Name degree; run;

France
Quartz | Level 8
Hello cmoore,

I am trying to match two datasets by company name as well. could you please share your code with me ?

thanks in advacne
best regards, France

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 23 replies
  • 12264 views
  • 5 likes
  • 6 in conversation