BookmarkSubscribeRSS Feed
Squashman
Calcite | Level 5

Was tasked with cleaning up some clients data.  They had the name of the person and the company they work for in one of 4 different possible fields. We have software to do name parsing and that software was able to loop through each of the 4 fields and identify which field had the persons name versus it being a job title or an address field.

One issue we have with our name parsing software is that it likes to identify real names as company names because the software uses a corporate identity table to determine if the name is a company.

So it sees names like the following as company names.

John A Christian

Shirley L Church

Robert R Grill

Obviously you and I can see that these are not company names but because the words Church, Grill and Christian are in the corporate identity table they get coded as the company name.

So I thought about using SAS and a regular expression to try and weed out a few more true names.  The client isn't expecting us to be able to clean up 100% of their data but they are hoping to get about 95% of it cleaned up.  I am already at that thresh hold.  I started with 1.2 Million records and have it down to about 10 Thousand now but I like to go that extra mile.

So I just wrote a simple regular expression to try and find these 3 criteria within the name field:

1) Starts with a name or just an initial with an optional period

2) Middle Initial with an optional period

3) A last name that has at least two characters.

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}/i');

Problem with this expression is that it matches:

J D MILANE HOLDINGS LTD

S B PROVIDENCE INC

C J SMITH & ASSOC INC

E S PIKE MEMORIAL HOSPITAL INC

J H SMITH PACKING COMPANY

But the Kicker is I still need it to find names that have last names with spaces in them:

JOHN L DE VINE

SUSAN T MC COMB

BILL E VAN CAMP

I have no idea why VINE, COMB and CAMP are considered company name parts. I have no control over that but we can pretty much deduce that these are a person's real name.

Now I suppose I could just make another regular expression to reject all the records that end in " INC" or " LTD" or " COMPANY" but I am trying to make a single regular expression to do as much as possible.

Just looking for some ideas on how to tighten up this regular expression to get rid of the company names but still keep the true names. It may take more than one regular expression with setting some flags and that would be fine as well.

I know there are lot of caveats to doing this but hoping for a decent solution to get a few more records with true names.  I know it is not going to be perfect.

9 REPLIES 9
PGStats
Opal | Level 21

Congratulations for a very well stated problem. I would start with taking care of names that end with INC, CO, COMPANY, LTD, LIMITED, etc. Simply because those are the most obvious cases and, as a consequence, the most embarrassing to get wrong.

Then you could add a list of optional name prefixes (DE|DEL|VAN|VON|MC|MAC)? to your pattern and maybe some suffixes (JR.|SR.)?

PG

PG
Squashman
Calcite | Level 5

Our name parsing software will never code a name that ends in a valid suffix as a company.  The software works back to front.  If it sees JR, SR, MD, PHD, etc....at the end of the name it will code it as a normal name.

Squashman
Calcite | Level 5

I tried putting a $ in to tighten it up to just match names like John A Church thinking this would just give me names with 3 parts but I got no hits with this code addition:

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}$/i');

I thought for sure this would match these types of names:

John A Christian

Shirley L Church

Robert R Grill

PGStats
Opal | Level 21

When you use both ^ and $, you must STRIP or TRIM the target string, or match leading and trailing blanks. - PG

PG
Squashman
Calcite | Level 5

Yep.  I had that in there.

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,}$/i');

name = trim(name);

pos = prxmatch (nameregx, name);

PGStats
Opal | Level 21

No, that won't work. NAME keeps its length and is padded with blanks. You need

pos = prxmatch (nameregx, trim(name));

PG
Squashman
Calcite | Level 5

Thanks PG.

That worked.

I am still rather new to SAS.  Still learning.

Now we are getting some where.

I changed the RX to this.

nameregx = prxparse ('/^[A-Z]*\.? [A-Z]\.? [A-Z]{2,3}\s?[A-Z]*?$/i');

pos = prxmatch (nameregx, trim(name));


This is getting closer.  I need to run it through my live file tomorrow to see if there is any weird output or stuff that it is not catching but just my small test file is getting some better output with the examples I have shown already.

Going offline for the day.

Peter_L
Quartz | Level 8

If you use keywords to identify companies you may occasionally have problems, particularly if you work in languages other than English. I have a Spanish colleague with the last name "Companys".

Squashman
Calcite | Level 5

Hi Peter,

That is basically what our name software does.  It uses a Corporate Identity table to identify company names.  But our name software is smart enough to know if there is a valid Prefix or Suffix in the name that it will not code it as a company.

Mr. John S. Company - OK

John S. Company Jr. - OK

John S. Company - coded as a company.

So my goal is to try and use a regular expression to pull out names that have a common name pattern.

Of course it is not going to be perfect as in the United States we have companies like H. H. Gregg and Joseph A. Bank.  One is an electronics store and the other is a men's clothing store.

And of course the company I work for: R. R. Donnelley (Richard Robert Donnelley).

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 3258 views
  • 0 likes
  • 3 in conversation