BookmarkSubscribeRSS Feed
catnipper
Calcite | Level 5

I have a data set that has about 5-10%  duplicates, however they are not always obvious duplicates because its user entry data and there are mispellings. I need to filter out any potential duplicates, that will include real duplicates as well as people that only look like duplicates. I rather overmatch than forget anyone. 

 

Once I have a list of duplicates, I will get the Cartesian product and use fuzzy matching to get the actual duplicates. The real data is too large to do a Cartesian product, hence I need to subset before that step.

 

I faked some data to include predominantly duplicates, the real data has way less dups but it helps to cover more scenarios.

 

Here is my solution, however it seems rather clunky and I was wondering if anyone has a better approach for this?

 

data have;
input race $ dob  firstname $ lastname $ A B C PARID;
datalines;
W	1	Dav	Hoff	        1	1	1	1
H	2	Qyi	Kla		1	1	1	2
H	3	Tri	Fla		1	1	1	3
W	4	.	.		1	1	1	4
B	5 	Lip     .		1	1	1	5
B	5 	Lip     Star	        1	1	1	6
H	7	.	Lin		1	1	1	7
H	7	Kai	Lin		1	1	1	8
B	2	Rip	Baf		1	1	1	9
W	2	Rip	Baff	        1	1	1	10
.	3	Sup	Lass	        1	1	1	11
.	3	Sup	Lass	        1	1	1	12
W	.	Fit	.		1	1	1	13
W	.	Fit	.		1	1	1	14
.	8	Kaa	.		1	1	1	15
.	8	Kaa	.		1	1	1	16
;
run;

proc print data=have;
run;

/*data want*/
proc sql;
create table new1 as
SELECT *, CATs(firstname, lastname) as ID1, COUNT(*)
FROM have
GROUP BY ID1
HAVING COUNT(*) > 1
order by parid;
quit;

proc sql;
create table new2 as
SELECT *, CATs(Race,DOB) as ID2, COUNT(*)
FROM have
GROUP BY ID2
HAVING COUNT(*) > 1
order by parid;
quit;
proc sql;
create table new3 as
SELECT *, CATs(lastname,DOB) as ID3, COUNT(*)
FROM have
GROUP BY ID3
HAVING COUNT(*) > 1
order by parid;
quit;
proc sql;
create table new4 as
SELECT *, CATs(firstname,DOB) as ID4, COUNT(*)
FROM have
GROUP BY ID4
HAVING COUNT(*) > 1
order by parid;
quit;

data want;
merge new1-new4;
by parid;
drop ID1 _TEMG001 ID2 ID3 ID4;
run;

proc print data=want;
var parid firstname lastname race dob;
run;

 

 

2 REPLIES 2
ChrisBrooks
Ammonite | Level 13

The only time I ever had to look for duplicate names with possible misspellings I used the SOUNDEX function - it's not perfect and only really works with names of European origin but it might be worth a try. Once you've identified possible duplicates from that you could go on to match against race and DOB.

catnipper
Calcite | Level 5
I have used it too but it won't work well with many of the names I am having and the many missing values. I have to account all/most possible scenarios. I may add it as ID5 though and merge an additional table in.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 447 views
  • 0 likes
  • 2 in conversation