Re: Comparing company names: compged vs complev vs spedis

somebody · Posted 05-31-2020 04:24 AM

I have to compare between 2 company name variables to match data. I am using the three functions in the title.

compged = compged(name,name2);
complev = complev(name,name2);
spedis = spedis(name,name2);

What are the minimum and maximum values for each function? and what does it mean? To my understanding, the lower the better. 0 means an exact match. However, COMPGED returns a very high number (600 or 1000) for a closely matched obs.

The attached photo shows some examples. The first 4 rows would be a match and the bottom 2 would not be a match. The 3 functions return very high numbers. I am aware that this is because of missing letters in variable NAME2 for the first 4 rows. But what I want to achieve is a return of a match for the first 4 rows, and no match for the last 2. However, using the values from the 3 functions cannot help to determine that.

What would be a better approach?

Ksharp · Posted 05-31-2020 07:06 AM

spedis() stand for Spell Distance .
comp*() stand for Edited Distance
There is something different between them ,depend on different scenario .

ChrisHemedinger · Posted 05-31-2020 09:27 AM

Some articles that might help:

Register for SAS Innovate 2025!! The premier event for SAS users, May 6-9 in Orlando FL. Sign up now for the best deals!

ballardw · Posted 05-31-2020 02:13 PM

From a quick look at your example you should seriously consider using the options to ignore case. Some of the results you get are being inflated because of case differences.

The documentation on the functions tells you what is considered and used to assign values. One note from the documentation is that Compged and Complev are faster than Spedis.

And Compged can work with Call Compcost so you set the rules for how much certain of the rules set though not an exercise for the faint of heart. This might be a serious advantage if you know a lot about some behaviors between the sources of the strings.

Mahult · Posted 10-30-2024 12:19 PM

As has been noted by others, adding the ‘i’ modifier to the COMPLEV and COMPGED functions to ignore case will lower those values. But there’s always a risk of false positives like with your last two rows.

Assume you’re looking for a match for “Minnesota Power Company” and among your rows of data you have “MN Power Co.” and “Minnesota Shower Company”. “Minnesota Shower Company” requires only two small changes to become “Minnesota Power Company” and so is the closest match. It’s a problem that can be alleviated a bit by using fuzzy logic in addition to other join or search criteria, i.e., match on other fields if available and if those match then names can be fuzzy matched more safely. For example, you might have 8 companies at the same address with the only difference being unique PO #s that aren’t in the data, so all the addresses are the same. Now if our fuzzy functions find two company names at the same address to be very similar you can more safely assume they’re a valid match. But false positives are still possible. Unfortunately, “fuzzy” really is “fuzzy” and not the warm kind. This logic can be helpful but comes with inherent risks.

With that said, in this case, using the ‘i’ modifier and filtering on COMPGED < 2500 gets you what you want - first four rows included, last two excluded:

data test;

infile cards dsd missover;

input (Name1 Name2) (:$75.);

spedis=spedis(Name1,Name2);

complev=complev(Name1,Name2,'i');

compged=compged(Name1,Name2,'i');

if compged < 2500;

cards;

Northern Property Real Estate Investment Trust, NORTHERN PPTY REAL ESTATE INVT TR

Mapletree Commercial Trust Units Real Estate Investment Trust Reg, MAPLETREE COMMERCI

Soilbuild Business Space REIT Units Real Estate Investment Trust, SOILBUILD BUSINESS

Northern Property real Estate Investment Trust / NorSerCo. Inc., NORTHERN PPTY REAL ESTATE INVT TR

Shanghai Tonva Petrochemical Co Ltd H Shares, Shanghai Dasheng Agriculture Finance Technology Co Ltd Class H

China Rongsheng Heavy Industries Group Holdings Ltd. H Shares, China Huarong Energy Co Ltd

;

run;

Output:

This kind of stuff requires lots of testing, possibly extra logic and maybe the use of more than one fuzzy technique, and finally a decision about the level of risk you’re willing to accept.

Ksharp · Posted 10-30-2024 10:23 PM

Here is a fuzz matched algorithm I wrote before (could be apply to Chinese or Japanese ..... Characters).

data a;
infile cards truncover;
input a $200.;
cards;
Jon                                                                                                                      
John                                                                                                                  
Johnny                                                                                                             
Will                                                                                                                  
Willy                                                                                                                  
Willie                                                                                                            
Milly 
;
data b;
infile cards truncover;
input b $200.;
cards;
Johnathan                                                                                                                       
William  
;

proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $)  ;
length new_a new_b str_short str_long  $ 200;
new_a=kcompress(a,,'kad');
new_b=kcompress(b,,'kad');
if klength(a)<klength(b) then do;
 str_short=new_a;str_long=new_b;
end;
else do;
  str_short=new_b;str_long=new_a;
end;

array count{10} _temporary_  (0 0 0  0  0  0  0   0    0    0);  /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024);  /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b));  /*Search the max length of str is 10*/
do len=1 to max_str_len;      /*Search one string ,two string , three string .......*/
 n=0;
 do start=1 to klength(str_short)+1-len;  /*Search sub-string in  long string*/
  if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
 end;
 count{len}=n; 
end;


sum=0;w=0;mean=0;
do k=1 to max_str_len;
  if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
 else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;

options cmplib=work.math;
proc sql;
create table want as
select a,b,fuzz_match(a,b) as weight
 from  a, b
  order by a,weight desc;
quit;

Comparing company names: compged vs complev vs spedis