topic Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS in SAS Programming

Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS

_Hopper — Fri, 12 Dec 2025 19:46:42 GMT

How can I generate a numeric score of similarity between records in a dataset that would help me find records that might be duplicates but have typos from one entry to the next. I also would not know in which fields this would occur. The dataset would contain both numeric and character variables.

Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS

ChrisHemedinger — Fri, 12 Dec 2025 20:29:34 GMT

It sounds like you are seeking a method of "fuzzy matching" -- identifying values that are close to the same, maybe intended to be the same, but due to the quality of the data, they are different.

Here's a thorough paper by Stephen Sloan that describes an approach and includes code that you can use as a process. For a more narrow tactic using "spelling distance", see this article by Ron Cody.

SAS offers data quality tools to help with this problem for certain types of data, such as names and addresses. These apply a standardization approach that would recognize values like "Bob Smith" and "Robert Smith" as being probably the same entity. Here's another paper specifically about addresses and some base SAS approaches.

Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS

Ksharp — Sat, 13 Dec 2025 09:23:03 GMT

The first thing is combining all the variables into one variables .

a.k.a make a column for a row.

Here could give you a start.

data have;
 set sashelp.heart(obs=100);
 length row $ 200;
 obs+1;
 row=catx('|',of _all_);
 keep obs row;
 run;






 
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $)  ;
length new_a new_b str_short str_long  $ 200;  /*only support the max length of string is 200*/
new_a=kcompress(a,'|','kad');
new_b=kcompress(b,'|','kad');
if klength(a)<klength(b) then do;
 str_short=new_a;str_long=new_b;
end;
else do;
  str_short=new_b;str_long=new_a;
end;

array count{10} _temporary_  (0 0 0  0  0  0  0   0    0    0);  /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024);  /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b));  /*Search the max length of str is 10*/
do len=1 to max_str_len;      /*Search  string which's length is 1, 2, 3.......*/
 n=0;
 do start=1 to klength(str_short)+1-len;  /*Search sub-string in  long string*/
  if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
 end;
 count{len}=n; 
end;


sum=0;w=0;mean=0;
do k=1 to max_str_len;
  if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
 else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;



options cmplib=work.math;
proc sql;
create table want as
select a.obs,a.row as row1,b.obs as obs2,b.row as row2,fuzz_match(a.row,b.row) as weight
 from  have as a,have as b
  where a.obs ne b.obs
  order by a.obs,weight desc;
quit;

Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS

_Hopper — Mon, 15 Dec 2025 14:56:42 GMT

Thanks this helps. For longer strings the array size must increase correct? Also the string length can and is quite long, what other adjustments are required to account for that?

Lastly, is a lower score a better match or a higher one?

Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS

Stu_SAS — Tue, 16 Dec 2025 01:55:29 GMT

If you have SAS Data Quality, PROC DQMATCH will do this for you. You can specify which variable(s) and match definition to use, and optionally apply multiple conditions. This generates a match code for a condition. If you do not apply a condition statement then it is a single composite condition that generates one match code. If the match codes between rows are identical, there is a degree of confidence that those entities are the same.

Here is an example using both names and addresses together.

Name	Address
Bill Bond	202 E Cedar St Cary NC 27511-3440
William L. Bond	202 East Cedar Street, Cary, North Carolina, 27511
Prais Hilton	100 SAS Campus Dr, Cary, NC 27513
Paris Hilton	100 SAS campus Drive Cary NC, 27513

In our data, we know that rows 1 and 2 are the same person, and 3 and 4 are the same person. Row 3 has a mispelled name, while Row 4 has the correct name. The addresses also differ in how they've been input, and one includes a ZIP+4. We need to programmatically determine who is who.

We can use the Name (with Suggestions) match definition to find similar names with mispellings, and the Address match definition for the addresses. If the combination of those two generate the same match code, we will assign them the same group.

You may need to adjust the sensitivity (between 50-100) to get the desired results. Lower sensitivity increases fuzziness, while higher sensitivity decreases fuzziness. You can mix sensitivity on a per-variable basis.

data customers;
    length Name Address $50.;
    infile datalines dlm='|';
    input Name$ Address$;
    datalines;
Bill Bond|202 E Cedar St Cary NC 27511-3440
William L. Bond|202 East Cedar Street, Cary, North Carolina, 27511 
Prais Hilton|100 SAS Campus Dr, Cary, NC 27513
Paris Hilton|100 SAS campus Drive Cary NC, 27513
;
run;

/* Load the USA English locale */
%dqload(dqlocale=(enusa));

proc dqmatch 
    data = customers 
    out  = customers2 
    matchcode = match_cd   /* Output variable name of match code */
    cluster   = group_id;  /* Output variable name of a cluster of same match codes */

    criteria var=Name    matchdef='Name (with Suggestions)' sensitivity=50; /* Name must match */
    criteria var=Address matchdef='Address' sensitivity=60;                 /* and Address must match */
run;

GROUP_ID	Name	          Address	                                        MATCH_CD
1	        Bill Bond	  202 E Cedar St Cary NC 27511-3440	                M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
1	        William L. Bond	  202 East Cedar Street, Cary, North Carolina, 27511    M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2	        Prais Hilton	  100 SAS Campus Dr, Cary, NC 27513	                2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2	        Paris Hilton	  100 SAS campus Drive Cary NC, 27513	                2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS

Ksharp — Tue, 16 Dec 2025 07:06:55 GMT

"For longer strings the array size must increase correct? "
Nope. array is used for checking sub-string ,not whole string.
Here I check sub-string which's length is 1,2,3,4,...10.

"Also the string length can and is quite long, what other adjustments are required to account for that?"
You need to change the following line:
length new_a new_b str_short str_long $ 200;
As I commented that max length of string is 200. if you have string longer, change 200 into other number.

"Lastly, is a lower score a better match or a higher one?"
A higher score match better.
I already sort it by WEIGHT decending. The first obs in each group (a.obs) is the best match , you can just keep it and drop others .