BookmarkSubscribeRSS Feed
Melk
Lapis Lazuli | Level 10

I have a string with names with varying formats:

 

name

First middle_initial last

first middle last

first middle_initial last

 

I need to extract first and last names into separate variables, but the name may have 2 first names, ie "mary ann l smith" in which case I need the new variables to read:

first               last

mary ann      smith

 

or 2 last names such as "mary l smith jones, which would read

first              last

mary            smith jones

 

How can I do this taking into acount the names may or may not have a middle initial? none of the names would have a letter count of less than 2, so I feel like that would be my best bet? Perhaps 2 if statements depending on if there is a middle initial or not..

8 REPLIES 8
SuryaKiran
Meteorite | Level 14

Do you have a situation where there will not be middle initial?

If yes, then if you have a name "mary smith jones" then how can you know the first name and last name. I mean first name can be Mary smith or just Mary. You never know unless it is properly delimited. 

 

For this kind of situation the name maybe delimited by tab, single space between first names (ie: Mary ann) and separated by tab next. Consecutive blanks might represent missing values between then SCAN() function with "m" as 4th argument might work. 

data test ;
input name $50.;
Datalines;
Marry Ann	I	Jones
Marry		Jones
;
run;
/*
Marry Ann<tab>I<tab>Jones
Marry<tab><tab>Jones
*/
data want;
set test;
First=scan(name,1,'09'x,'m');
middle=scan(name,2,'09'x,'m');
last=scan(name,3,'09'x,'m');
run;

Check if your data is delimited by tab or someway to identify the first, middle and last name. As far as I know all the source systems will create the string with proper delimiters to identify. If not you may need to change the way your source data is sent. 

Thanks,
Suryakiran
ballardw
Super User

Last names from some areas may also have spaces such as "Le Blanc" "Von Braun" and just plain creative people who name there children things like "Moonbeam Glory Morning Sunshine", 3, 4 or 5 "first" and "middle" names.

 

Letter count less than 2 occurs in some cases of Hispanic "last names" such as "y" signifying the family was from 2 (or more) properties in their history.

 

Have fun. Names are the second worst data I deal with.

ruchi11dec
Obsidian | Level 7

Hi,

 

I really did not got the scenario when it will be like the name is having a middle name or not
I mean if name is Mary Le Blanc
how can we say that Mary Le is the first name and Blanc is the last name or if mary is the first name Le is the middle and Blanc is the last name.

ballardw
Super User

@ruchi11dec wrote:

Hi,

 

I really did not got the scenario when it will be like the name is having a middle name or not
I mean if name is Mary Le Blanc
how can we say that Mary Le is the first name and Blanc is the last name or if mary is the first name Le is the middle and Blanc is the last name.


Which could well be that the first name is Mary and the last name is Le Blanc and no middle name at all. Which was part of the point of bringing it out. Naming conventions vary by culture or country of origin and data that collects "name" without data entry limits runs into messy processing.

Kurt_Bremser
Super User

If you can't make up a rule that works if you do it on paper yourself, then you have nothing that can be translated into code. This is true for any programming language.

 

Find that rule, once you have it, we can help in converting it to SAS code.

 

This is the reason why any decent database for people always has separate fields for surname(s) and given name. There's no other way of keeping names that works.

 

 

Melk
Lapis Lazuli | Level 10

Data is dirty often and sometimes you just have to deal with it best as possible. I already developed a strategy for the best possible scenario. It is not perfect, but it is optimal. I don't answer questions on this board to listen to people complain about them. These are real life situations and datasets from reputable companies.

ballardw
Super User

@Melk wrote:

Data is dirty often and sometimes you just have to deal with it best as possible. I already developed a strategy for the best possible scenario. It is not perfect, but it is optimal. I don't answer questions on this board to listen to people complain about them. These are real life situations and datasets from reputable companies.


I don't think anyone is "complaining" but pointing out potential issues in the process.

 

And those potential issues have come from very real data from equally reputable sources.

Examples of the name data I have had to process had all "name" information in a single field. Since the data related to children some times both parents last names were included in the name field with no regularity for order of mother or father last names. Date were entered as:

first name, middle name, last name

first name, last name

last name, first name, middle name

last name, first name, last name, middle initial

last name-lastname (hyphenated), middle initial, first name (and the same very distinctive last names hyphenated in different order)

with occasional sprinklings of Junior, second, third, II, III and such after first names or middle names or last names.

In my data for this project of roughly 15,000 names only about 30 percent  was I comfortable with assuming the data were first name, middle name, last name.

And for extra joy the unique identifier assigned would sometimes have a somewhat different name either spelling or dropping one of the "last names" associated or adding a middle name or initial.

 

And I had to match this data set to a separate data source on name, date of birth and gender.

 

Luckily after dealing with that data source for a year or so they transitioned to a collection system that actually collected the data into first name, middle name(s), last name (singular) data entry. This was not in the 1980's but as recent as 2012 when data collection folks should have had the word 30 years ago that a single name field is poor design.

 

Other "name" data I have had to process actually included data comments like "see grandma in Apt B" or "the green trailer".

tomrvincent
Rhodochrosite | Level 12
Anything further on this?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 20676 views
  • 8 likes
  • 6 in conversation