I have to compare between 2 company name variables to match data. I am using the three functions in the title.
compged = compged(name,name2);
complev = complev(name,name2);
spedis = spedis(name,name2);
What are the minimum and maximum values for each function? and what does it mean? To my understanding, the lower the better. 0 means an exact match. However, COMPGED returns a very high number (600 or 1000) for a closely matched obs.
The attached photo shows some examples. The first 4 rows would be a match and the bottom 2 would not be a match. The 3 functions return very high numbers. I am aware that this is because of missing letters in variable NAME2 for the first 4 rows. But what I want to achieve is a return of a match for the first 4 rows, and no match for the last 2. However, using the values from the 3 functions cannot help to determine that.
What would be a better approach?
Some articles that might help:
From a quick look at your example you should seriously consider using the options to ignore case. Some of the results you get are being inflated because of case differences.
The documentation on the functions tells you what is considered and used to assign values. One note from the documentation is that Compged and Complev are faster than Spedis.
And Compged can work with Call Compcost so you set the rules for how much certain of the rules set though not an exercise for the faint of heart. This might be a serious advantage if you know a lot about some behaviors between the sources of the strings.
As has been noted by others, adding the ‘i’ modifier to the COMPLEV and COMPGED functions to ignore case will lower those values. But there’s always a risk of false positives like with your last two rows.
Assume you’re looking for a match for “Minnesota Power Company” and among your rows of data you have “MN Power Co.” and “Minnesota Shower Company”. “Minnesota Shower Company” requires only two small changes to become “Minnesota Power Company” and so is the closest match. It’s a problem that can be alleviated a bit by using fuzzy logic in addition to other join or search criteria, i.e., match on other fields if available and if those match then names can be fuzzy matched more safely. For example, you might have 8 companies at the same address with the only difference being unique PO #s that aren’t in the data, so all the addresses are the same. Now if our fuzzy functions find two company names at the same address to be very similar you can more safely assume they’re a valid match. But false positives are still possible. Unfortunately, “fuzzy” really is “fuzzy” and not the warm kind. This logic can be helpful but comes with inherent risks.
With that said, in this case, using the ‘i’ modifier and filtering on COMPGED < 2500 gets you what you want - first four rows included, last two excluded:
data test;
infile cards dsd missover;
input (Name1 Name2) (:$75.);
spedis=spedis(Name1,Name2);
complev=complev(Name1,Name2,'i');
compged=compged(Name1,Name2,'i');
if compged < 2500;
cards;
Northern Property Real Estate Investment Trust, NORTHERN PPTY REAL ESTATE INVT TR
Mapletree Commercial Trust Units Real Estate Investment Trust Reg, MAPLETREE COMMERCI
Soilbuild Business Space REIT Units Real Estate Investment Trust, SOILBUILD BUSINESS
Northern Property real Estate Investment Trust / NorSerCo. Inc., NORTHERN PPTY REAL ESTATE INVT TR
Shanghai Tonva Petrochemical Co Ltd H Shares, Shanghai Dasheng Agriculture Finance Technology Co Ltd Class H
China Rongsheng Heavy Industries Group Holdings Ltd. H Shares, China Huarong Energy Co Ltd
;
run;
Output:
This kind of stuff requires lots of testing, possibly extra logic and maybe the use of more than one fuzzy technique, and finally a decision about the level of risk you’re willing to accept.
Here is a fuzz matched algorithm I wrote before (could be apply to Chinese or Japanese ..... Characters).
data a;
infile cards truncover;
input a $200.;
cards;
Jon
John
Johnny
Will
Willy
Willie
Milly
;
data b;
infile cards truncover;
input b $200.;
cards;
Johnathan
William
;
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $) ;
length new_a new_b str_short str_long $ 200;
new_a=kcompress(a,,'kad');
new_b=kcompress(b,,'kad');
if klength(a)<klength(b) then do;
str_short=new_a;str_long=new_b;
end;
else do;
str_short=new_b;str_long=new_a;
end;
array count{10} _temporary_ (0 0 0 0 0 0 0 0 0 0); /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024); /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b)); /*Search the max length of str is 10*/
do len=1 to max_str_len; /*Search one string ,two string , three string .......*/
n=0;
do start=1 to klength(str_short)+1-len; /*Search sub-string in long string*/
if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
end;
count{len}=n;
end;
sum=0;w=0;mean=0;
do k=1 to max_str_len;
if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;
options cmplib=work.math;
proc sql;
create table want as
select a,b,fuzz_match(a,b) as weight
from a, b
order by a,weight desc;
quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.