<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: regular expression to identify names in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185072#M35141</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Yep.&amp;nbsp; I had that in there.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}&lt;/SPAN&gt;&lt;STRONG style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif;"&gt;$&lt;/STRONG&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;/i');&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;name = trim(name);&lt;/P&gt;&lt;P&gt;pos = prxmatch (nameregx, name);&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Wed, 04 Jun 2014 19:50:59 GMT</pubDate>
    <dc:creator>Squashman</dc:creator>
    <dc:date>2014-06-04T19:50:59Z</dc:date>
    <item>
      <title>regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185067#M35136</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Was tasked with cleaning up some clients data.&amp;nbsp; They had the name of the person and the company they work for in one of 4 different possible fields. We have software to do name parsing and that software was able to loop through each of the 4 fields and identify which field had the persons name versus it being a job title or an address field.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;One issue we have with our name parsing software is that it likes to identify real names as company names because the software uses a corporate identity table to determine if the name is a company.&lt;/P&gt;&lt;P&gt;So it sees names like the following as company names.&lt;/P&gt;&lt;P&gt;John A Christian&lt;/P&gt;&lt;P&gt;Shirley L Church&lt;/P&gt;&lt;P&gt;Robert R Grill&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Obviously you and I can see that these are not company names but because the words Church, Grill and Christian are in the corporate identity table they get coded as the company name.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So I thought about using SAS and a regular expression to try and weed out a few more true names.&amp;nbsp; The client isn't expecting us to be able to clean up 100% of their data but they are hoping to get about 95% of it cleaned up.&amp;nbsp; I am already at that thresh hold.&amp;nbsp; I started with 1.2 Million records and have it down to about 10 Thousand now but I like to go that extra mile.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So I just wrote a simple regular expression to try and find these 3 criteria within the name field:&lt;/P&gt;&lt;P&gt;1) Starts with a name or just an initial with an optional period&lt;/P&gt;&lt;P&gt;2) Middle Initial with an optional period&lt;/P&gt;&lt;P&gt;3) A last name that has at least two characters.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}/i');&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Problem with this expression is that it matches:&lt;/P&gt;&lt;P&gt;J D MILANE HOLDINGS LTD&lt;/P&gt;&lt;P&gt;S B PROVIDENCE INC&lt;/P&gt;&lt;P&gt;C J SMITH &amp;amp; ASSOC INC&lt;/P&gt;&lt;P&gt;E S PIKE MEMORIAL HOSPITAL INC&lt;/P&gt;&lt;P&gt;J H SMITH PACKING COMPANY&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But the Kicker is I still need it to find names that have last names with spaces in them:&lt;/P&gt;&lt;P&gt;JOHN L DE VINE&lt;/P&gt;&lt;P&gt;SUSAN T MC COMB&lt;/P&gt;&lt;P&gt;BILL E VAN CAMP&lt;/P&gt;&lt;P&gt;I have no idea why VINE, COMB and CAMP are considered company name parts. I have no control over that but we can pretty much deduce that these are a person's real name.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now I suppose I could just make another regular expression to reject all the records that end in " INC" or " LTD" or " COMPANY" but I am trying to make a single regular expression to do as much as possible.&lt;/P&gt;&lt;P&gt;Just looking for some ideas on how to tighten up this regular expression to get rid of the company names but still keep the true names. It may take more than one regular expression with setting some flags and that would be fine as well.&lt;/P&gt;&lt;P&gt;I know there are lot of caveats to doing this but hoping for a decent solution to get a few more records with true names.&amp;nbsp; I know it is not going to be perfect.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 16:42:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185067#M35136</guid>
      <dc:creator>Squashman</dc:creator>
      <dc:date>2014-06-04T16:42:49Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185068#M35137</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Congratulations for a very well stated problem. I would start with taking care of names that end with INC, CO, COMPANY, LTD, LIMITED, etc. Simply because those are the most obvious cases and, as a consequence, the most embarrassing to get wrong. &lt;/P&gt;&lt;P&gt;Then you could add a list of optional name prefixes (DE|DEL|VAN|VON|MC|MAC)? to your pattern and maybe some suffixes (JR.|SR.)?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;PG&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 17:55:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185068#M35137</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2014-06-04T17:55:45Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185069#M35138</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Our name parsing software will never code a name that ends in a valid suffix as a company.&amp;nbsp; The software works back to front.&amp;nbsp; If it sees JR, SR, MD, PHD, etc....at the end of the name it will code it as a normal name.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 18:06:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185069#M35138</guid>
      <dc:creator>Squashman</dc:creator>
      <dc:date>2014-06-04T18:06:33Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185070#M35139</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I tried putting a $ in to tighten it up to just match names like John A Church thinking this would just give me names with 3 parts but I got no hits with this code addition:&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}&lt;STRONG&gt;$&lt;/STRONG&gt;/i');&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;I thought for sure this would match these types of names:&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;John A Christian&lt;/P&gt;&lt;P style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;Shirley L Church&lt;/P&gt;&lt;P style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;Robert R Grill&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 18:58:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185070#M35139</guid>
      <dc:creator>Squashman</dc:creator>
      <dc:date>2014-06-04T18:58:42Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185071#M35140</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;When you use both ^ and $, you must STRIP or TRIM the target string, or match leading and trailing blanks. - PG&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 19:45:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185071#M35140</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2014-06-04T19:45:32Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185072#M35141</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Yep.&amp;nbsp; I had that in there.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}&lt;/SPAN&gt;&lt;STRONG style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif;"&gt;$&lt;/STRONG&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;/i');&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;name = trim(name);&lt;/P&gt;&lt;P&gt;pos = prxmatch (nameregx, name);&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 19:50:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185072#M35141</guid>
      <dc:creator>Squashman</dc:creator>
      <dc:date>2014-06-04T19:50:59Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185073#M35142</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;No, that won't work. NAME keeps its length and is padded with blanks. You need &lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 12.800000190734863px; background-color: #ffffff;"&gt;pos = prxmatch (nameregx, trim(name));&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 20:06:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185073#M35142</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2014-06-04T20:06:37Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185074#M35143</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thanks PG.&lt;/P&gt;&lt;P&gt;That worked.&lt;/P&gt;&lt;P&gt;I am still rather new to SAS.&amp;nbsp; Still learning.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now we are getting some where.&lt;/P&gt;&lt;P&gt;I changed the RX to this.&lt;/P&gt;&lt;P&gt;nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,3}\s?[A-Z]*?$/i');&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;pos = prxmatch (nameregx, trim(name));&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;This is getting closer.&amp;nbsp; I need to run it through my live file tomorrow to see if there is any weird output or stuff that it is not catching but just my small test file is getting some better output with the examples I have shown already.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;Going offline for the day.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Jun 2014 20:20:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185074#M35143</guid>
      <dc:creator>Squashman</dc:creator>
      <dc:date>2014-06-04T20:20:00Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185075#M35144</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If you use keywords to identify companies you may occasionally have problems, particularly if you work in languages other than English. I have a Spanish colleague with the last name "Companys".&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 05 Jun 2014 11:12:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185075#M35144</guid>
      <dc:creator>Peter_L</dc:creator>
      <dc:date>2014-06-05T11:12:04Z</dc:date>
    </item>
    <item>
      <title>Re: regular expression to identify names</title>
      <link>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185076#M35145</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Peter,&lt;/P&gt;&lt;P&gt;That is basically what our name software does.&amp;nbsp; It uses a Corporate Identity table to identify company names.&amp;nbsp; But our name software is smart enough to know if there is a valid Prefix or Suffix in the name that it will not code it as a company.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Mr. John S. Company - OK&lt;/P&gt;&lt;P&gt;John S. Company Jr. - OK&lt;/P&gt;&lt;P&gt;John S. Company - coded as a company.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So my goal is to try and use a regular expression to pull out names that have a common name pattern.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Of course it is not going to be perfect as in the United States we have companies like H. H. Gregg and Joseph A. Bank.&amp;nbsp; One is an electronics store and the other is a men's clothing store.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And of course the company I work for: R. R. Donnelley (&lt;SPAN style="color: #252525; font-family: sans-serif; font-size: 14px; background-color: #ffffff;"&gt;Richard Robert Donnelley&lt;/SPAN&gt;).&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 05 Jun 2014 13:19:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/regular-expression-to-identify-names/m-p/185076#M35145</guid>
      <dc:creator>Squashman</dc:creator>
      <dc:date>2014-06-05T13:19:55Z</dc:date>
    </item>
  </channel>
</rss>

