BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ChrisNZ
Tourmaline | Level 20

There is no commonality between your two files. 

How is this supposed to help us help you?

ChrisNZ
Tourmaline | Level 20

More to the point, have you tried fuzzy matching using the recommended steps above?

 

Some tips:

1. Upcase everything

2. Change all & or other symbols such as @ from text

3. Ensure Co and/or LTD are standardized, i.e. LTD vs LTD. vs Limited vs Limited => This could be replaced to deal with middle names and/or initials, in the case of names

4. Do an exact match and remove those records

5. Remove all spaces and try a second round of exact matches

6. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE/SPEDIS and equality of first letter

7. Do fuzzy matching using COMPGED/COMPLEV/SOUNDS LIKE

 

COMPGED works best in my experience

 

as well as @cmoore's cleaning steps (extract, upcase, compress, justify, compl) to implement point 2 above?

 

Cleaned_directorname = scan(directorname, 1,'()');

Cleaned_directorname = upcase(Cleaned_directorname);

Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");

Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);

Cleaned_directorname = compbl(Cleaned_directorname);

 

[Edited formatting]

cmoore
Obsidian | Level 7

Hi,

 

The kind of coding you could you use to fuzzy match on name can be seen below. You just need to amend your variables to fit your data tables;

 

/*Non-Exact Matches - Use the Algorithm to match variables - on Company Names*/

proc sql;

create table NonExact_matches_Comp as

select s1.company_name as Original_Company_Name,
s1.cleaned_company_name as Cleaned_Company_name,
s2.cleaned_company_name as connect_company_name,
s2.cleaned_group_name as connect_group_name,
s2.client_group_number,
compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN')as compged,
spedis(s1.cleaned_company_name,s2.cleaned_company_name)as spedis

from s1
inner join
s2
on (compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN') < 200
and spedis(s1.cleaned_company_name,s2.cleaned_company_name) < 50)
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
having calculated compged = min(compged(s1.cleaned_company_name,s2.cleaned_company_name,999,'LN'))
;
quit;

 

ChrisNZ
Tourmaline | Level 20

@cmoore  This should be a multi-step process, going from best to worst match quality, as outlined above.

 

This ensures faster processing as well as better understanding of the quality of the results.

cmoore
Obsidian | Level 7

@ChrisNZ yes total agree with your logic and those are the steps that I used to write my matching programme. The code above is just a snippet of the code that I used at the very end of the process where I needed to use fuzzy logic on poor data that was left after data cleansing and removing exact matches.

EJAA
Obsidian | Level 7

Hello @cmoore and @ChrisNZ

Thanks very much for the directions. My problem resolved now. Appreciate the the assistance. 

Regards

EJAA

France
Quartz | Level 8
hello EJAA,

I am trying to match two datasets by company name now. could you please share your code with me ?

thanks in advacne
best regards, France
EJAA
Obsidian | Level 7

Hello France,

This the final code what i used to do the matching. Just do the necessary changes.It worked for me.  Hope it helps. Regard EJAA. 

 

data s1;

set s1;

format Cleaned_exec_fullname $255.;

/*Name*/
Cleaned_exec_fullname = scan(exec_fullname, 1,'()');
Cleaned_exec_fullname = upcase(Cleaned_exec_fullname);
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,",,.,/,*,',","");
Cleaned_exec_fullname = compress(Cleaned_exec_fullname,"""");

Cleaned_exec_fullname = strip(Cleaned_exec_fullname);
Cleaned_exec_fullname = compbl(Cleaned_exec_fullname);

run;

 

data s2;

set s2;

format Cleaned_directorname $255.;

/* Name*/
Cleaned_directorname = scan(directorname, 1,'()');
Cleaned_directorname = upcase(Cleaned_directorname);
Cleaned_directorname = compress(Cleaned_directorname,",,.,/,*,',","");
Cleaned_directorname = compress(Cleaned_directorname,"""");

Cleaned_directorname = strip(Cleaned_directorname);
Cleaned_directorname = compbl(Cleaned_directorname);

run;

proc sql noprint;


**Selecting exact matches;
create table Exact_matches_Comp as
select s1.exec_fullname as Original_Exec_Name,
s1.Cleaned_exec_fullname as Cleaned_Exec_Name,
s2.directorname as Original_Director_Name,
s2.Cleaned_directorname as Cleaned_Director_Name,
s2.degree

from s1 inner join s2
on s1.Cleaned_exec_fullname = s2.Cleaned_directorname
where s1.Cleaned_exec_fullname is not null;
quit;

proc sort data=Exact_matches_Comp nodupkey; by Cleaned_Director_Name degree; run;

France
Quartz | Level 8
Hello cmoore,

I am trying to match two datasets by company name as well. could you please share your code with me ?

thanks in advacne
best regards, France

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 23 replies
  • 7987 views
  • 5 likes
  • 6 in conversation