List of potential duplicate values without removing them - Better Ways

catnipper · Posted 01-28-2019 07:13 PM

I have a data set that has about 5-10% duplicates, however they are not always obvious duplicates because its user entry data and there are mispellings. I need to filter out any potential duplicates, that will include real duplicates as well as people that only look like duplicates. I rather overmatch than forget anyone.

Once I have a list of duplicates, I will get the Cartesian product and use fuzzy matching to get the actual duplicates. The real data is too large to do a Cartesian product, hence I need to subset before that step.

I faked some data to include predominantly duplicates, the real data has way less dups but it helps to cover more scenarios.

Here is my solution, however it seems rather clunky and I was wondering if anyone has a better approach for this?

data have;
input race $ dob  firstname $ lastname $ A B C PARID;
datalines;
W	1	Dav	Hoff	        1	1	1	1
H	2	Qyi	Kla		1	1	1	2
H	3	Tri	Fla		1	1	1	3
W	4	.	.		1	1	1	4
B	5 	Lip     .		1	1	1	5
B	5 	Lip     Star	        1	1	1	6
H	7	.	Lin		1	1	1	7
H	7	Kai	Lin		1	1	1	8
B	2	Rip	Baf		1	1	1	9
W	2	Rip	Baff	        1	1	1	10
.	3	Sup	Lass	        1	1	1	11
.	3	Sup	Lass	        1	1	1	12
W	.	Fit	.		1	1	1	13
W	.	Fit	.		1	1	1	14
.	8	Kaa	.		1	1	1	15
.	8	Kaa	.		1	1	1	16
;
run;

proc print data=have;
run;

/*data want*/
proc sql;
create table new1 as
SELECT *, CATs(firstname, lastname) as ID1, COUNT(*)
FROM have
GROUP BY ID1
HAVING COUNT(*) > 1
order by parid;
quit;

proc sql;
create table new2 as
SELECT *, CATs(Race,DOB) as ID2, COUNT(*)
FROM have
GROUP BY ID2
HAVING COUNT(*) > 1
order by parid;
quit;
proc sql;
create table new3 as
SELECT *, CATs(lastname,DOB) as ID3, COUNT(*)
FROM have
GROUP BY ID3
HAVING COUNT(*) > 1
order by parid;
quit;
proc sql;
create table new4 as
SELECT *, CATs(firstname,DOB) as ID4, COUNT(*)
FROM have
GROUP BY ID4
HAVING COUNT(*) > 1
order by parid;
quit;

data want;
merge new1-new4;
by parid;
drop ID1 _TEMG001 ID2 ID3 ID4;
run;

proc print data=want;
var parid firstname lastname race dob;
run;

ChrisBrooks · Posted 01-28-2019 07:24 PM

The only time I ever had to look for duplicate names with possible misspellings I used the SOUNDEX function - it's not perfect and only really works with names of European origin but it might be worth a try. Once you've identified possible duplicates from that you could go on to match against race and DOB.

catnipper · Posted 01-28-2019 07:32 PM

I have used it too but it won't work well with many of the names I am having and the many missing values. I have to account all/most possible scenarios. I may add it as ID5 though and merge an additional table in.

List of potential duplicate values without removing them - Better Ways

Re: List of potential duplicate values without removing them - Better Ways

Re: List of potential duplicate values without removing them - Better Ways

Catch up on SAS Innovate 2026