DATA Step, Macro, Functions and more

regular expression to identify names

Reply
Occasional Contributor
Posts: 13

regular expression to identify names

Was tasked with cleaning up some clients data.  They had the name of the person and the company they work for in one of 4 different possible fields. We have software to do name parsing and that software was able to loop through each of the 4 fields and identify which field had the persons name versus it being a job title or an address field.

One issue we have with our name parsing software is that it likes to identify real names as company names because the software uses a corporate identity table to determine if the name is a company.

So it sees names like the following as company names.

John A Christian

Shirley L Church

Robert R Grill

Obviously you and I can see that these are not company names but because the words Church, Grill and Christian are in the corporate identity table they get coded as the company name.

So I thought about using SAS and a regular expression to try and weed out a few more true names.  The client isn't expecting us to be able to clean up 100% of their data but they are hoping to get about 95% of it cleaned up.  I am already at that thresh hold.  I started with 1.2 Million records and have it down to about 10 Thousand now but I like to go that extra mile.

So I just wrote a simple regular expression to try and find these 3 criteria within the name field:

1) Starts with a name or just an initial with an optional period

2) Middle Initial with an optional period

3) A last name that has at least two characters.

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}/i');

Problem with this expression is that it matches:

J D MILANE HOLDINGS LTD

S B PROVIDENCE INC

C J SMITH & ASSOC INC

E S PIKE MEMORIAL HOSPITAL INC

J H SMITH PACKING COMPANY

But the Kicker is I still need it to find names that have last names with spaces in them:

JOHN L DE VINE

SUSAN T MC COMB

BILL E VAN CAMP

I have no idea why VINE, COMB and CAMP are considered company name parts. I have no control over that but we can pretty much deduce that these are a person's real name.

Now I suppose I could just make another regular expression to reject all the records that end in " INC" or " LTD" or " COMPANY" but I am trying to make a single regular expression to do as much as possible.

Just looking for some ideas on how to tighten up this regular expression to get rid of the company names but still keep the true names. It may take more than one regular expression with setting some flags and that would be fine as well.

I know there are lot of caveats to doing this but hoping for a decent solution to get a few more records with true names.  I know it is not going to be perfect.

Respected Advisor
Posts: 4,920

Re: regular expression to identify names

Posted in reply to Squashman

Congratulations for a very well stated problem. I would start with taking care of names that end with INC, CO, COMPANY, LTD, LIMITED, etc. Simply because those are the most obvious cases and, as a consequence, the most embarrassing to get wrong.

Then you could add a list of optional name prefixes (DE|DEL|VAN|VON|MC|MAC)? to your pattern and maybe some suffixes (JR.|SR.)?

PG

PG
Occasional Contributor
Posts: 13

Re: regular expression to identify names

Our name parsing software will never code a name that ends in a valid suffix as a company.  The software works back to front.  If it sees JR, SR, MD, PHD, etc....at the end of the name it will code it as a normal name.

Occasional Contributor
Posts: 13

Re: regular expression to identify names

Posted in reply to Squashman

I tried putting a $ in to tighten it up to just match names like John A Church thinking this would just give me names with 3 parts but I got no hits with this code addition:

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}$/i');

I thought for sure this would match these types of names:

John A Christian

Shirley L Church

Robert R Grill

Respected Advisor
Posts: 4,920

Re: regular expression to identify names

Posted in reply to Squashman

When you use both ^ and $, you must STRIP or TRIM the target string, or match leading and trailing blanks. - PG

PG
Occasional Contributor
Posts: 13

Re: regular expression to identify names

Yep.  I had that in there.

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}$/i');

name = trim(name);

pos = prxmatch (nameregx, name);

Respected Advisor
Posts: 4,920

Re: regular expression to identify names

Posted in reply to Squashman

No, that won't work. NAME keeps its length and is padded with blanks. You need

pos = prxmatch (nameregx, trim(name));

PG
Occasional Contributor
Posts: 13

Re: regular expression to identify names

Thanks PG.

That worked.

I am still rather new to SAS.  Still learning.

Now we are getting some where.

I changed the RX to this.

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,3}\s?[A-Z]*?$/i');

pos = prxmatch (nameregx, trim(name));


This is getting closer.  I need to run it through my live file tomorrow to see if there is any weird output or stuff that it is not catching but just my small test file is getting some better output with the examples I have shown already.

Going offline for the day.

Contributor
Posts: 32

Re: regular expression to identify names

Posted in reply to Squashman

If you use keywords to identify companies you may occasionally have problems, particularly if you work in languages other than English. I have a Spanish colleague with the last name "Companys".

Occasional Contributor
Posts: 13

Re: regular expression to identify names

Hi Peter,

That is basically what our name software does.  It uses a Corporate Identity table to identify company names.  But our name software is smart enough to know if there is a valid Prefix or Suffix in the name that it will not code it as a company.

Mr. John S. Company - OK

John S. Company Jr. - OK

John S. Company - coded as a company.

So my goal is to try and use a regular expression to pull out names that have a common name pattern.

Of course it is not going to be perfect as in the United States we have companies like H. H. Gregg and Joseph A. Bank.  One is an electronics store and the other is a men's clothing store.

And of course the company I work for: R. R. Donnelley (Richard Robert Donnelley).

Ask a Question
Discussion stats
  • 9 replies
  • 1074 views
  • 0 likes
  • 3 in conversation