I have a data set that has about 5-10% duplicates, however they are not always obvious duplicates because its user entry data and there are mispellings. I need to filter out any potential duplicates, that will include real duplicates as well as people that only look like duplicates. I rather overmatch than forget anyone.
Once I have a list of duplicates, I will get the Cartesian product and use fuzzy matching to get the actual duplicates. The real data is too large to do a Cartesian product, hence I need to subset before that step.
I faked some data to include predominantly duplicates, the real data has way less dups but it helps to cover more scenarios.
Here is my solution, however it seems rather clunky and I was wondering if anyone has a better approach for this?
data have;
input race $ dob firstname $ lastname $ A B C PARID;
datalines;
W 1 Dav Hoff 1 1 1 1
H 2 Qyi Kla 1 1 1 2
H 3 Tri Fla 1 1 1 3
W 4 . . 1 1 1 4
B 5 Lip . 1 1 1 5
B 5 Lip Star 1 1 1 6
H 7 . Lin 1 1 1 7
H 7 Kai Lin 1 1 1 8
B 2 Rip Baf 1 1 1 9
W 2 Rip Baff 1 1 1 10
. 3 Sup Lass 1 1 1 11
. 3 Sup Lass 1 1 1 12
W . Fit . 1 1 1 13
W . Fit . 1 1 1 14
. 8 Kaa . 1 1 1 15
. 8 Kaa . 1 1 1 16
;
run;
proc print data=have;
run;
/*data want*/
proc sql;
create table new1 as
SELECT *, CATs(firstname, lastname) as ID1, COUNT(*)
FROM have
GROUP BY ID1
HAVING COUNT(*) > 1
order by parid;
quit;
proc sql;
create table new2 as
SELECT *, CATs(Race,DOB) as ID2, COUNT(*)
FROM have
GROUP BY ID2
HAVING COUNT(*) > 1
order by parid;
quit;
proc sql;
create table new3 as
SELECT *, CATs(lastname,DOB) as ID3, COUNT(*)
FROM have
GROUP BY ID3
HAVING COUNT(*) > 1
order by parid;
quit;
proc sql;
create table new4 as
SELECT *, CATs(firstname,DOB) as ID4, COUNT(*)
FROM have
GROUP BY ID4
HAVING COUNT(*) > 1
order by parid;
quit;
data want;
merge new1-new4;
by parid;
drop ID1 _TEMG001 ID2 ID3 ID4;
run;
proc print data=want;
var parid firstname lastname race dob;
run;