If you have SAS Data Quality, PROC DQMATCH will do this for you. You can specify which variable(s) and match definition to use, and optionally apply multiple conditions. This generates a match code for a condition. If you do not apply a condition statement then it is a single composite condition that generates one match code. If the match codes between rows are identical, there is a degree of confidence that those entities are the same.
Here is an example using both names and addresses together.
Name
Address
Bill Bond
202 E Cedar St Cary NC 27511-3440
William L. Bond
202 East Cedar Street, Cary, North Carolina, 27511
Prais Hilton
100 SAS Campus Dr, Cary, NC 27513
Paris Hilton
100 SAS campus Drive Cary NC, 27513
In our data, we know that rows 1 and 2 are the same person, and 3 and 4 are the same person. Row 3 has a mispelled name, while Row 4 has the correct name. The addresses also differ in how they've been input, and one includes a ZIP+4. We need to programmatically determine who is who.
We can use the Name (with Suggestions) match definition to find similar names with mispellings, and the Address match definition for the addresses. If the combination of those two generate the same match code, we will assign them the same group.
You may need to adjust the sensitivity (between 50-100) to get the desired results. Lower sensitivity increases fuzziness, while higher sensitivity decreases fuzziness. You can mix sensitivity on a per-variable basis.
data customers;
length Name Address $50.;
infile datalines dlm='|';
input Name$ Address$;
datalines;
Bill Bond|202 E Cedar St Cary NC 27511-3440
William L. Bond|202 East Cedar Street, Cary, North Carolina, 27511
Prais Hilton|100 SAS Campus Dr, Cary, NC 27513
Paris Hilton|100 SAS campus Drive Cary NC, 27513
;
run;
/* Load the USA English locale */
%dqload(dqlocale=(enusa));
proc dqmatch
data = customers
out = customers2
matchcode = match_cd /* Output variable name of match code */
cluster = group_id; /* Output variable name of a cluster of same match codes */
criteria var=Name matchdef='Name (with Suggestions)' sensitivity=50; /* Name must match */
criteria var=Address matchdef='Address' sensitivity=60; /* and Address must match */
run;
GROUP_ID Name Address MATCH_CD
1 Bill Bond 202 E Cedar St Cary NC 27511-3440 M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
1 William L. Bond 202 East Cedar Street, Cary, North Carolina, 27511 M&P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&8&Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2 Prais Hilton 100 SAS Campus Dr, Cary, NC 27513 2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2 Paris Hilton 100 SAS campus Drive Cary NC, 27513 2&W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&43&$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
... View more