<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Matching data with messy names in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Matching-data-with-messy-names/m-p/762695#M241502</link>
    <description>&lt;P&gt;Do you have any other data in the two data sets in common such as date of birth, residence (city, state/province, postal code), telephone number or such?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you have additional information such as above you may want to investigate Link Plus from the CDC. Website:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm" target="_blank"&gt;https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is a probabilistic matching program that will use multiple pieces of information and provide a probability of a match between two records. It will report the probability that two records match so you might be able to reduce manual coding a bit and only spend time with the problem cases. I've never used it this tool with only names but have used it to identify when a data base was migrated and determined individuals whose names were spelled differently or had changes in date of birth, race, ethnicity or sex when the data was transferred between data systems. One thing is that the software is &lt;STRONG&gt;free&lt;/STRONG&gt;. It does want text files with column headers (SAS makes those pretty easy with Proc Export) and you provide the rules for which fields to compare between two files. Example one file might have date of birth information in a column DOB and the other BirthDate. So you tell the program those columns are to be compared.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you have tens of thousands of cases and only names I would call this a "job security" problem because there is so much garbage you are going to have to deal with. You haven't even addresses things like Robert, Bob and Bobby or similar nickname patterns.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;One problem I have with eliminating practically any give word like "Old" from these fields are some of the actual names. Consider celebrity children names like Moon Unit Zappa, or anglicized Native American names. There are also parents out there that go a long way to get interesting names for children, Though matches on those name may be more obvious it might also mean you don't want to just throw out words.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A proactive approach would be to go the source of the data and say they have to watch their data entry standards a bit more closely.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I say this from dealing with things like Address data for geocoding that had addresses with text like "see the woman in the building behind the main house"&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You may also want to make sure you aren't dealing with multiple character sets such as accented letters before attempting this.&lt;/P&gt;</description>
    <pubDate>Thu, 19 Aug 2021 21:02:26 GMT</pubDate>
    <dc:creator>ballardw</dc:creator>
    <dc:date>2021-08-19T21:02:26Z</dc:date>
    <item>
      <title>Matching data with messy names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-data-with-messy-names/m-p/762672#M241492</link>
      <description>&lt;P&gt;I have two datasets that I need to match based on names. However, the names can be really messy.&lt;/P&gt;&lt;P&gt;I have two main issues.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;First,&lt;/STRONG&gt; for example,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Dataset A&lt;/STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;STRONG&gt;Dataset B&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Name&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Name&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;De la Rosa&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Maria Rosa&lt;/P&gt;&lt;P&gt;De la Maria&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Rosa&lt;/P&gt;&lt;P&gt;Maria de la Rosa&amp;nbsp; &amp;nbsp; &amp;nbsp;Maria de Rosa&lt;/P&gt;&lt;P&gt;Maria Laura&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Maria&lt;/P&gt;&lt;P&gt;Maria Gabriel Rosa&amp;nbsp; &amp;nbsp;Gabriel&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The standard that I consider true match will be De la Rosa and Rosa, De la Rosa and Maria Rosa, Maria de la Rosa and Maria de Rosa, and Maria Laura and Maria. And Maria Gabriel Rosa would be a match with either Maria, Gabriel or Rosa. But Maria Laura and Maria Rosa will not be considered as a match.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So basically if a name has more than one part with meanings, eg.&lt;STRONG&gt; Maria Rosa&lt;/STRONG&gt; (de la is not considered as meaningful), it would be a match with names of either of the individual part, eg. &lt;STRONG&gt;Maria&lt;/STRONG&gt;. But it would not be a match with another name with two meaningful parts, like &lt;STRONG&gt;Maria Laura&lt;/STRONG&gt;.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What I think may help is to get ride of strings like "de", " de la" (I have a bunch of others with no meanings in Hispanic names) and create new names like:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Dataset A&lt;/STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;STRONG&gt;Dataset B&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;name&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; new_name&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;name&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;new_name&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;De la Rosa&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Rosa&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Maria Rosa&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Maria Rosa&lt;/P&gt;&lt;P&gt;De la Maria&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Maria&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Rosa&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Rosa&lt;/P&gt;&lt;P&gt;Maria de la Rosa&amp;nbsp; &amp;nbsp; Maria Rosa&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Maria de Rosa&amp;nbsp; &amp;nbsp; Maria Rosa&lt;/P&gt;&lt;P&gt;Maria Laura&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Maria Laura&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Maria&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Maria&lt;/P&gt;&lt;P&gt;Maria Gabriel Rosa Maria Gabriel Rosa&amp;nbsp; &amp;nbsp;Gabriel&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Gabriel&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I think i can use prxchange to do that.&lt;/P&gt;&lt;P&gt;However,&lt;STRONG&gt; I still don't know&lt;/STRONG&gt; how I can make sure all the matched are found, considering there are names with equal and more than two meaningful parts (I have names up to four meaningful parts).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The &lt;STRONG&gt;second&lt;/STRONG&gt; issue is that there are a bunch of noises in the names, eg. Do not use, Duplicate, Gone, no existing file.&lt;/P&gt;&lt;P&gt;They do not always stand alone. It can be "Maria Do not use", "Jose (Gone)", [OLD]John. I'm thinking to use macro to get rid of them and add newly found ones to the macro for future years of data match. I wrote this:&lt;/P&gt;&lt;P&gt;%LET NOISE = "DO NOT USE","GONE", "DUPLICATE";&lt;/P&gt;&lt;P&gt;If find (name, "NOISE")&amp;gt;0, then do;&lt;/P&gt;&lt;P&gt;new_name=prxchange("s/(&amp;amp;NOISE)"/ /,name);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;On top of that is that I have really large files with tens of thousands of records. I'm just really stuck.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for any suggestion.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Aug 2021 19:52:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-data-with-messy-names/m-p/762672#M241492</guid>
      <dc:creator>GingerJJ</dc:creator>
      <dc:date>2021-08-19T19:52:35Z</dc:date>
    </item>
    <item>
      <title>Re: Matching data with messy names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-data-with-messy-names/m-p/762674#M241493</link>
      <description>&lt;P&gt;Generalized edit distance:&amp;nbsp; The COMPGED() function is made for this type of thing.&amp;nbsp; But this changes nothing, this fuzzy math is still difficult.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://documentation.sas.com/doc/en/vdmmlcdc/8.1/lefunctionsref/p1r4l9jwgatggtn1ko81fyjys4s7.htm" target="_blank"&gt;SAS Help Center: COMPGED Function&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Aug 2021 19:57:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-data-with-messy-names/m-p/762674#M241493</guid>
      <dc:creator>PhilC</dc:creator>
      <dc:date>2021-08-19T19:57:36Z</dc:date>
    </item>
    <item>
      <title>Re: Matching data with messy names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-data-with-messy-names/m-p/762695#M241502</link>
      <description>&lt;P&gt;Do you have any other data in the two data sets in common such as date of birth, residence (city, state/province, postal code), telephone number or such?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you have additional information such as above you may want to investigate Link Plus from the CDC. Website:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm" target="_blank"&gt;https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is a probabilistic matching program that will use multiple pieces of information and provide a probability of a match between two records. It will report the probability that two records match so you might be able to reduce manual coding a bit and only spend time with the problem cases. I've never used it this tool with only names but have used it to identify when a data base was migrated and determined individuals whose names were spelled differently or had changes in date of birth, race, ethnicity or sex when the data was transferred between data systems. One thing is that the software is &lt;STRONG&gt;free&lt;/STRONG&gt;. It does want text files with column headers (SAS makes those pretty easy with Proc Export) and you provide the rules for which fields to compare between two files. Example one file might have date of birth information in a column DOB and the other BirthDate. So you tell the program those columns are to be compared.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you have tens of thousands of cases and only names I would call this a "job security" problem because there is so much garbage you are going to have to deal with. You haven't even addresses things like Robert, Bob and Bobby or similar nickname patterns.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;One problem I have with eliminating practically any give word like "Old" from these fields are some of the actual names. Consider celebrity children names like Moon Unit Zappa, or anglicized Native American names. There are also parents out there that go a long way to get interesting names for children, Though matches on those name may be more obvious it might also mean you don't want to just throw out words.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A proactive approach would be to go the source of the data and say they have to watch their data entry standards a bit more closely.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I say this from dealing with things like Address data for geocoding that had addresses with text like "see the woman in the building behind the main house"&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You may also want to make sure you aren't dealing with multiple character sets such as accented letters before attempting this.&lt;/P&gt;</description>
      <pubDate>Thu, 19 Aug 2021 21:02:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-data-with-messy-names/m-p/762695#M241502</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2021-08-19T21:02:26Z</dc:date>
    </item>
  </channel>
</rss>

