How can I generate a numeric score of similarity between records in a dataset that would help me find records that might be duplicates but have typos from one entry to the next. I also would not know in which fields this would occur. The dataset would contain both numeric and character variables.
It sounds like you are seeking a method of "fuzzy matching" -- identifying values that are close to the same, maybe intended to be the same, but due to the quality of the data, they are different.
Here's a thorough paper by Stephen Sloan that describes an approach and includes code that you can use as a process. For a more narrow tactic using "spelling distance", see this article by Ron Cody.
SAS offers data quality tools to help with this problem for certain types of data, such as names and addresses. These apply a standardization approach that would recognize values like "Bob Smith" and "Robert Smith" as being probably the same entity. Here's another paper specifically about addresses and some base SAS approaches.
The first thing is combining all the variables into one variables .
a.k.a make a column for a row.
Here could give you a start.
data have;
set sashelp.heart(obs=100);
length row $ 200;
obs+1;
row=catx('|',of _all_);
keep obs row;
run;
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $) ;
length new_a new_b str_short str_long $ 200; /*only support the max length of string is 200*/
new_a=kcompress(a,'|','kad');
new_b=kcompress(b,'|','kad');
if klength(a)<klength(b) then do;
str_short=new_a;str_long=new_b;
end;
else do;
str_short=new_b;str_long=new_a;
end;
array count{10} _temporary_ (0 0 0 0 0 0 0 0 0 0); /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024); /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b)); /*Search the max length of str is 10*/
do len=1 to max_str_len; /*Search string which's length is 1, 2, 3.......*/
n=0;
do start=1 to klength(str_short)+1-len; /*Search sub-string in long string*/
if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
end;
count{len}=n;
end;
sum=0;w=0;mean=0;
do k=1 to max_str_len;
if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;
options cmplib=work.math;
proc sql;
create table want as
select a.obs,a.row as row1,b.obs as obs2,b.row as row2,fuzz_match(a.row,b.row) as weight
from have as a,have as b
where a.obs ne b.obs
order by a.obs,weight desc;
quit;
Thanks this helps. For longer strings the array size must increase correct? Also the string length can and is quite long, what other adjustments are required to account for that?
Lastly, is a lower score a better match or a higher one?
If you have SAS Data Quality, PROC DQMATCH will do this for you. You can specify which variable(s) and match definition to use, and optionally apply multiple conditions. This generates a match code for a condition. If you do not apply a condition statement then it is a single composite condition that generates one match code. If the match codes between rows are identical, there is a degree of confidence that those entities are the same.
Here is an example using both names and addresses together.
| Name | Address |
| Bill Bond |
202 E Cedar St Cary NC 27511-3440
|
| William L. Bond |
202 East Cedar Street, Cary, North Carolina, 27511
|
| Prais Hilton | 100 SAS Campus Dr, Cary, NC 27513 |
| Paris Hilton | 100 SAS campus Drive Cary NC, 27513 |
In our data, we know that rows 1 and 2 are the same person, and 3 and 4 are the same person. Row 3 has a mispelled name, while Row 4 has the correct name. The addresses also differ in how they've been input, and one includes a ZIP+4. We need to programmatically determine who is who.
We can use the Name (with Suggestions) match definition to find similar names with mispellings, and the Address match definition for the addresses. If the combination of those two generate the same match code, we will assign them the same group.
You may need to adjust the sensitivity (between 50-100) to get the desired results. Lower sensitivity increases fuzziness, while higher sensitivity decreases fuzziness. You can mix sensitivity on a per-variable basis.
data customers;
length Name Address $50.;
infile datalines dlm='|';
input Name$ Address$;
datalines;
Bill Bond|202 E Cedar St Cary NC 27511-3440
William L. Bond|202 East Cedar Street, Cary, North Carolina, 27511
Prais Hilton|100 SAS Campus Dr, Cary, NC 27513
Paris Hilton|100 SAS campus Drive Cary NC, 27513
;
run;
/* Load the USA English locale */
%dqload(dqlocale=(enusa));
proc dqmatch
data = customers
out = customers2
matchcode = match_cd /* Output variable name of match code */
cluster = group_id; /* Output variable name of a cluster of same match codes */
criteria var=Name matchdef='Name (with Suggestions)' sensitivity=50; /* Name must match */
criteria var=Address matchdef='Address' sensitivity=60; /* and Address must match */
run;
GROUP_ID Name Address MATCH_CD 1 Bill Bond 202 E Cedar St Cary NC 27511-3440 M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ 1 William L. Bond 202 East Cedar Street, Cary, North Carolina, 27511 M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ 2 Prais Hilton 100 SAS Campus Dr, Cary, NC 27513 2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ 2 Paris Hilton 100 SAS campus Drive Cary NC, 27513 2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.