BookmarkSubscribeRSS Feed
_Hopper
Obsidian | Level 7

How can I generate a numeric score of similarity between records in a dataset that would help me find records that might be duplicates but have typos from one entry to the next. I also would not know in which fields this would occur. The dataset would contain both numeric and character variables.

2 REPLIES 2
ChrisHemedinger
Community Manager

It sounds like you are seeking a method of "fuzzy matching" -- identifying values that are close to the same, maybe intended to be the same, but due to the quality of the data, they are different.

 

Here's a thorough paper by Stephen Sloan that describes an approach  and includes code that you can use as a process. For a more narrow tactic using "spelling distance", see this article by Ron Cody

 

SAS offers data quality tools to help with this problem for certain types of data, such as names and addresses. These apply a standardization approach that would recognize values like "Bob Smith" and "Robert Smith" as being probably the same entity. Here's another paper specifically about addresses and some base SAS approaches.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
Ksharp
Super User

The first thing is combining all the variables into one variables .

a.k.a make a column for a row.

Here could give you a start.

 

data have;
 set sashelp.heart(obs=100);
 length row $ 200;
 obs+1;
 row=catx('|',of _all_);
 keep obs row;
 run;






 
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $)  ;
length new_a new_b str_short str_long  $ 200;  /*only support the max length of string is 200*/
new_a=kcompress(a,'|','kad');
new_b=kcompress(b,'|','kad');
if klength(a)<klength(b) then do;
 str_short=new_a;str_long=new_b;
end;
else do;
  str_short=new_b;str_long=new_a;
end;

array count{10} _temporary_  (0 0 0  0  0  0  0   0    0    0);  /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024);  /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b));  /*Search the max length of str is 10*/
do len=1 to max_str_len;      /*Search  string which's length is 1, 2, 3.......*/
 n=0;
 do start=1 to klength(str_short)+1-len;  /*Search sub-string in  long string*/
  if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
 end;
 count{len}=n; 
end;


sum=0;w=0;mean=0;
do k=1 to max_str_len;
  if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
 else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;



options cmplib=work.math;
proc sql;
create table want as
select a.obs,a.row as row1,b.obs as obs2,b.row as row2,fuzz_match(a.row,b.row) as weight
 from  have as a,have as b
  where a.obs ne b.obs
  order by a.obs,weight desc;
quit;

 

 

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 113 views
  • 0 likes
  • 3 in conversation