How can I generate a numeric score of similarity between records in a dataset that would help me find records that might be duplicates but have typos from one entry to the next. I also would not know in which fields this would occur. The dataset would contain both numeric and character variables.
It sounds like you are seeking a method of "fuzzy matching" -- identifying values that are close to the same, maybe intended to be the same, but due to the quality of the data, they are different.
Here's a thorough paper by Stephen Sloan that describes an approach and includes code that you can use as a process. For a more narrow tactic using "spelling distance", see this article by Ron Cody.
SAS offers data quality tools to help with this problem for certain types of data, such as names and addresses. These apply a standardization approach that would recognize values like "Bob Smith" and "Robert Smith" as being probably the same entity. Here's another paper specifically about addresses and some base SAS approaches.
The first thing is combining all the variables into one variables .
a.k.a make a column for a row.
Here could give you a start.
data have;
set sashelp.heart(obs=100);
length row $ 200;
obs+1;
row=catx('|',of _all_);
keep obs row;
run;
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $) ;
length new_a new_b str_short str_long $ 200; /*only support the max length of string is 200*/
new_a=kcompress(a,'|','kad');
new_b=kcompress(b,'|','kad');
if klength(a)<klength(b) then do;
str_short=new_a;str_long=new_b;
end;
else do;
str_short=new_b;str_long=new_a;
end;
array count{10} _temporary_ (0 0 0 0 0 0 0 0 0 0); /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024); /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b)); /*Search the max length of str is 10*/
do len=1 to max_str_len; /*Search string which's length is 1, 2, 3.......*/
n=0;
do start=1 to klength(str_short)+1-len; /*Search sub-string in long string*/
if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
end;
count{len}=n;
end;
sum=0;w=0;mean=0;
do k=1 to max_str_len;
if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;
options cmplib=work.math;
proc sql;
create table want as
select a.obs,a.row as row1,b.obs as obs2,b.row as row2,fuzz_match(a.row,b.row) as weight
from have as a,have as b
where a.obs ne b.obs
order by a.obs,weight desc;
quit;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.