<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: List of potential duplicate values without removing them - Better Ways in New SAS User</title>
    <link>https://communities.sas.com/t5/New-SAS-User/List-of-potential-duplicate-values-without-removing-them-Better/m-p/530827#M5812</link>
    <description>I have used it too but it won't work well with many of the names I am having and the many missing values. I have to account all/most possible scenarios. I may add it as ID5 though and merge an additional table in.</description>
    <pubDate>Tue, 29 Jan 2019 00:32:25 GMT</pubDate>
    <dc:creator>catnipper</dc:creator>
    <dc:date>2019-01-29T00:32:25Z</dc:date>
    <item>
      <title>List of potential duplicate values without removing them - Better Ways</title>
      <link>https://communities.sas.com/t5/New-SAS-User/List-of-potential-duplicate-values-without-removing-them-Better/m-p/530824#M5809</link>
      <description>&lt;P&gt;I have a data set that has about 5-10%&amp;nbsp; duplicates, however they are not always obvious duplicates because its user entry data and there are mispellings. I need to filter out any &lt;U&gt;potential&lt;/U&gt; duplicates, that will include real duplicates as well as people that only look like duplicates. I rather overmatch than forget anyone.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Once I have a list of duplicates, I will get the&amp;nbsp;&lt;SPAN&gt;Cartesian product and use fuzzy matching to get the actual duplicates. The real data is too large to do a&amp;nbsp;Cartesian product, hence I need to subset before that step.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I faked some data to include predominantly duplicates, the real data has way less dups but it helps to cover more scenarios.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Here is my solution, however it seems rather clunky and I was wondering if anyone has a better approach for this?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;data have;
input race $ dob  firstname $ lastname $ A B C PARID;
datalines;
W	1	Dav	Hoff	        1	1	1	1
H	2	Qyi	Kla		1	1	1	2
H	3	Tri	Fla		1	1	1	3
W	4	.	.		1	1	1	4
B	5 	Lip     .		1	1	1	5
B	5 	Lip     Star	        1	1	1	6
H	7	.	Lin		1	1	1	7
H	7	Kai	Lin		1	1	1	8
B	2	Rip	Baf		1	1	1	9
W	2	Rip	Baff	        1	1	1	10
.	3	Sup	Lass	        1	1	1	11
.	3	Sup	Lass	        1	1	1	12
W	.	Fit	.		1	1	1	13
W	.	Fit	.		1	1	1	14
.	8	Kaa	.		1	1	1	15
.	8	Kaa	.		1	1	1	16
;
run;

proc print data=have;
run;

/*data want*/
proc sql;
create table new1 as
SELECT *, CATs(firstname, lastname) as ID1, COUNT(*)
FROM have
GROUP BY ID1
HAVING COUNT(*) &amp;gt; 1
order by parid;
quit;

proc sql;
create table new2 as
SELECT *, CATs(Race,DOB) as ID2, COUNT(*)
FROM have
GROUP BY ID2
HAVING COUNT(*) &amp;gt; 1
order by parid;
quit;
proc sql;
create table new3 as
SELECT *, CATs(lastname,DOB) as ID3, COUNT(*)
FROM have
GROUP BY ID3
HAVING COUNT(*) &amp;gt; 1
order by parid;
quit;
proc sql;
create table new4 as
SELECT *, CATs(firstname,DOB) as ID4, COUNT(*)
FROM have
GROUP BY ID4
HAVING COUNT(*) &amp;gt; 1
order by parid;
quit;

data want;
merge new1-new4;
by parid;
drop ID1 _TEMG001 ID2 ID3 ID4;
run;

proc print data=want;
var parid firstname lastname race dob;
run;

&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Jan 2019 00:13:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/List-of-potential-duplicate-values-without-removing-them-Better/m-p/530824#M5809</guid>
      <dc:creator>catnipper</dc:creator>
      <dc:date>2019-01-29T00:13:07Z</dc:date>
    </item>
    <item>
      <title>Re: List of potential duplicate values without removing them - Better Ways</title>
      <link>https://communities.sas.com/t5/New-SAS-User/List-of-potential-duplicate-values-without-removing-them-Better/m-p/530825#M5810</link>
      <description>&lt;P&gt;The only time I ever had to look for duplicate names with possible misspellings I used the SOUNDEX function - it's not perfect and only really works with names of European origin but it might be worth a try. Once you've identified possible duplicates from that you could go on to match against race and DOB.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Jan 2019 00:24:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/List-of-potential-duplicate-values-without-removing-them-Better/m-p/530825#M5810</guid>
      <dc:creator>ChrisBrooks</dc:creator>
      <dc:date>2019-01-29T00:24:25Z</dc:date>
    </item>
    <item>
      <title>Re: List of potential duplicate values without removing them - Better Ways</title>
      <link>https://communities.sas.com/t5/New-SAS-User/List-of-potential-duplicate-values-without-removing-them-Better/m-p/530827#M5812</link>
      <description>I have used it too but it won't work well with many of the names I am having and the many missing values. I have to account all/most possible scenarios. I may add it as ID5 though and merge an additional table in.</description>
      <pubDate>Tue, 29 Jan 2019 00:32:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/List-of-potential-duplicate-values-without-removing-them-Better/m-p/530827#M5812</guid>
      <dc:creator>catnipper</dc:creator>
      <dc:date>2019-01-29T00:32:25Z</dc:date>
    </item>
  </channel>
</rss>

