BookmarkSubscribeRSS Feed
_Hopper
Obsidian | Level 7

How can I generate a numeric score of similarity between records in a dataset that would help me find records that might be duplicates but have typos from one entry to the next. I also would not know in which fields this would occur. The dataset would contain both numeric and character variables.

5 REPLIES 5
ChrisHemedinger
Community Manager

It sounds like you are seeking a method of "fuzzy matching" -- identifying values that are close to the same, maybe intended to be the same, but due to the quality of the data, they are different.

 

Here's a thorough paper by Stephen Sloan that describes an approach  and includes code that you can use as a process. For a more narrow tactic using "spelling distance", see this article by Ron Cody

 

SAS offers data quality tools to help with this problem for certain types of data, such as names and addresses. These apply a standardization approach that would recognize values like "Bob Smith" and "Robert Smith" as being probably the same entity. Here's another paper specifically about addresses and some base SAS approaches.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
Ksharp
Super User

The first thing is combining all the variables into one variables .

a.k.a make a column for a row.

Here could give you a start.

 

data have;
 set sashelp.heart(obs=100);
 length row $ 200;
 obs+1;
 row=catx('|',of _all_);
 keep obs row;
 run;






 
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $)  ;
length new_a new_b str_short str_long  $ 200;  /*only support the max length of string is 200*/
new_a=kcompress(a,'|','kad');
new_b=kcompress(b,'|','kad');
if klength(a)<klength(b) then do;
 str_short=new_a;str_long=new_b;
end;
else do;
  str_short=new_b;str_long=new_a;
end;

array count{10} _temporary_  (0 0 0  0  0  0  0   0    0    0);  /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024);  /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b));  /*Search the max length of str is 10*/
do len=1 to max_str_len;      /*Search  string which's length is 1, 2, 3.......*/
 n=0;
 do start=1 to klength(str_short)+1-len;  /*Search sub-string in  long string*/
  if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
 end;
 count{len}=n; 
end;


sum=0;w=0;mean=0;
do k=1 to max_str_len;
  if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
 else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;



options cmplib=work.math;
proc sql;
create table want as
select a.obs,a.row as row1,b.obs as obs2,b.row as row2,fuzz_match(a.row,b.row) as weight
 from  have as a,have as b
  where a.obs ne b.obs
  order by a.obs,weight desc;
quit;

 

 

_Hopper
Obsidian | Level 7

Thanks this helps. For longer strings the array size must increase correct? Also the string length can and is quite long, what other adjustments are required to account for that?

 

Lastly, is a lower score a better match or a higher one?

Ksharp
Super User
"For longer strings the array size must increase correct? "
Nope. array is used for checking sub-string ,not whole string.
Here I check sub-string which's length is 1,2,3,4,...10.


"Also the string length can and is quite long, what other adjustments are required to account for that?"
You need to change the following line:
length new_a new_b str_short str_long $ 200;
As I commented that max length of string is 200. if you have string longer, change 200 into other number.


"Lastly, is a lower score a better match or a higher one?"
A higher score match better.
I already sort it by WEIGHT decending. The first obs in each group (a.obs) is the best match , you can just keep it and drop others .
Stu_SAS
SAS Employee

If you have SAS Data Quality, PROC DQMATCH will do this for you. You can specify which variable(s) and match definition to use, and optionally apply multiple conditions. This generates a match code for a condition. If you do not apply a condition statement then it is a single composite condition that generates one match code. If the match codes between rows are identical, there is a degree of confidence that those entities are the same.

 

Here is an example using both names and addresses together.

 

Name Address
Bill Bond
202 E Cedar St Cary NC 27511-3440
William L. Bond
202 East Cedar Street, Cary, North Carolina, 27511
Prais Hilton 100 SAS Campus Dr, Cary, NC 27513
Paris Hilton 100 SAS campus Drive Cary NC, 27513

 

In our data, we know that rows 1 and 2 are the same person, and 3 and 4 are the same person. Row 3 has a mispelled name, while Row 4 has the correct name. The addresses also differ in how they've been input, and one includes a ZIP+4. We need to programmatically determine who is who.

 

We can use the Name (with Suggestions) match definition to find similar names with mispellings, and the Address match definition for the addresses. If the combination of those two generate the same match code, we will assign them the same group.

 

You may need to adjust the sensitivity (between 50-100) to get the desired results. Lower sensitivity increases fuzziness, while higher sensitivity decreases fuzziness. You can mix sensitivity on a per-variable basis.

 

 

data customers;
    length Name Address $50.;
    infile datalines dlm='|';
    input Name$ Address$;
    datalines;
Bill Bond|202 E Cedar St Cary NC 27511-3440
William L. Bond|202 East Cedar Street, Cary, North Carolina, 27511 
Prais Hilton|100 SAS Campus Dr, Cary, NC 27513
Paris Hilton|100 SAS campus Drive Cary NC, 27513
;
run;

/* Load the USA English locale */
%dqload(dqlocale=(enusa));

proc dqmatch 
    data = customers 
    out  = customers2 
    matchcode = match_cd   /* Output variable name of match code */
    cluster   = group_id;  /* Output variable name of a cluster of same match codes */

    criteria var=Name    matchdef='Name (with Suggestions)' sensitivity=50; /* Name must match */
    criteria var=Address matchdef='Address' sensitivity=60;                 /* and Address must match */
run;

 

GROUP_ID	Name	          Address	                                        MATCH_CD
1	        Bill Bond	  202 E Cedar St Cary NC 27511-3440	                M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
1	        William L. Bond	  202 East Cedar Street, Cary, North Carolina, 27511    M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2	        Prais Hilton	  100 SAS Campus Dr, Cary, NC 27513	                2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2	        Paris Hilton	  100 SAS campus Drive Cary NC, 27513	                2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

 

 

 

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 257 views
  • 2 likes
  • 4 in conversation