DATA Step, Macro, Functions and more

Extracting first and last words in a string

Reply
Regular Contributor
Posts: 150

Extracting first and last words in a string

I have a string with names with varying formats:

 

name

First middle_initial last

first middle last

first middle_initial last

 

I need to extract first and last names into separate variables, but the name may have 2 first names, ie "mary ann l smith" in which case I need the new variables to read:

first               last

mary ann      smith

 

or 2 last names such as "mary l smith jones, which would read

first              last

mary            smith jones

 

How can I do this taking into acount the names may or may not have a middle initial? none of the names would have a letter count of less than 2, so I feel like that would be my best bet? Perhaps 2 if statements depending on if there is a middle initial or not..

Valued Guide
Posts: 556

Re: Extracting first and last words in a string

[ Edited ]

Do you have a situation where there will not be middle initial?

If yes, then if you have a name "mary smith jones" then how can you know the first name and last name. I mean first name can be Mary smith or just Mary. You never know unless it is properly delimited. 

 

For this kind of situation the name maybe delimited by tab, single space between first names (ie: Mary ann) and separated by tab next. Consecutive blanks might represent missing values between then SCAN() function with "m" as 4th argument might work. 

data test ;
input name $50.;
Datalines;
Marry Ann	I	Jones
Marry		Jones
;
run;
/*
Marry Ann<tab>I<tab>Jones
Marry<tab><tab>Jones
*/
data want;
set test;
First=scan(name,1,'09'x,'m');
middle=scan(name,2,'09'x,'m');
last=scan(name,3,'09'x,'m');
run;

Check if your data is delimited by tab or someway to identify the first, middle and last name. As far as I know all the source systems will create the string with proper delimiters to identify. If not you may need to change the way your source data is sent. 

Thanks,
Suryakiran
Super User
Posts: 13,283

Re: Extracting first and last words in a string

Last names from some areas may also have spaces such as "Le Blanc" "Von Braun" and just plain creative people who name there children things like "Moonbeam Glory Morning Sunshine", 3, 4 or 5 "first" and "middle" names.

 

Letter count less than 2 occurs in some cases of Hispanic "last names" such as "y" signifying the family was from 2 (or more) properties in their history.

 

Have fun. Names are the second worst data I deal with.

Occasional Contributor
Posts: 17

Re: Extracting first and last words in a string

Hi,

 

I really did not got the scenario when it will be like the name is having a middle name or not
I mean if name is Mary Le Blanc
how can we say that Mary Le is the first name and Blanc is the last name or if mary is the first name Le is the middle and Blanc is the last name.

Super User
Posts: 13,283

Re: Extracting first and last words in a string

Posted in reply to ruchi11dec

@ruchi11dec wrote:

Hi,

 

I really did not got the scenario when it will be like the name is having a middle name or not
I mean if name is Mary Le Blanc
how can we say that Mary Le is the first name and Blanc is the last name or if mary is the first name Le is the middle and Blanc is the last name.


Which could well be that the first name is Mary and the last name is Le Blanc and no middle name at all. Which was part of the point of bringing it out. Naming conventions vary by culture or country of origin and data that collects "name" without data entry limits runs into messy processing.

Super User
Posts: 9,867

Re: Extracting first and last words in a string

If you can't make up a rule that works if you do it on paper yourself, then you have nothing that can be translated into code. This is true for any programming language.

 

Find that rule, once you have it, we can help in converting it to SAS code.

 

This is the reason why any decent database for people always has separate fields for surname(s) and given name. There's no other way of keeping names that works.

 

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Regular Contributor
Posts: 150

Re: Extracting first and last words in a string

Posted in reply to KurtBremser

Data is dirty often and sometimes you just have to deal with it best as possible. I already developed a strategy for the best possible scenario. It is not perfect, but it is optimal. I don't answer questions on this board to listen to people complain about them. These are real life situations and datasets from reputable companies.

Super User
Posts: 13,283

Re: Extracting first and last words in a string


@Melk wrote:

Data is dirty often and sometimes you just have to deal with it best as possible. I already developed a strategy for the best possible scenario. It is not perfect, but it is optimal. I don't answer questions on this board to listen to people complain about them. These are real life situations and datasets from reputable companies.


I don't think anyone is "complaining" but pointing out potential issues in the process.

 

And those potential issues have come from very real data from equally reputable sources.

Examples of the name data I have had to process had all "name" information in a single field. Since the data related to children some times both parents last names were included in the name field with no regularity for order of mother or father last names. Date were entered as:

first name, middle name, last name

first name, last name

last name, first name, middle name

last name, first name, last name, middle initial

last name-lastname (hyphenated), middle initial, first name (and the same very distinctive last names hyphenated in different order)

with occasional sprinklings of Junior, second, third, II, III and such after first names or middle names or last names.

In my data for this project of roughly 15,000 names only about 30 percent  was I comfortable with assuming the data were first name, middle name, last name.

And for extra joy the unique identifier assigned would sometimes have a somewhat different name either spelling or dropping one of the "last names" associated or adding a middle name or initial.

 

And I had to match this data set to a separate data source on name, date of birth and gender.

 

Luckily after dealing with that data source for a year or so they transitioned to a collection system that actually collected the data into first name, middle name(s), last name (singular) data entry. This was not in the 1980's but as recent as 2012 when data collection folks should have had the word 30 years ago that a single name field is poor design.

 

Other "name" data I have had to process actually included data comments like "see grandma in Apt B" or "the green trailer".

Ask a Question
Discussion stats
  • 7 replies
  • 114 views
  • 8 likes
  • 5 in conversation