- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a variable called 'FULL_NAME'. which contains a person's full name. However, I have been asked to create the following variables:
-Title (e.g. Mr, Ms etc.)
-Forename
-Middle_Names
-Full_Middle_Initial (all of the middle name initials in full)
-Reduced_Middle_Initial (just the first initial of any middle names e.g. if Peter, Owen and Dennis are the three middle names, this variable would just contain the initial of of the first middle name: P)
-Surname
For example. if there is a person called "Mr James Peter Owen Dennis Coyle-Harper", what code is best to
get the following results for the variables I have listed above? The desired outcome in my example would work for my all of my 100,000 accounts:
-Title: Mr
-Forename: James
-Middle_Names: Peter Owen Dennis
-Full_Middle_Initial: P O D
-Reduced_Middle_Initial: P
-Surname: Coyle-Harper
I have currently got code that uses the scan function, but I am not sure what the best method would be to identify the middle names as there could be multiple, and the surname would just be the last word from the 'FULL_NAME' variable.
data test;
set Name_Information;
Title=scan(FULL_NAME,1,' ');
Forename=scan(FULL_NAME,2,' ');
Middle_Names=???
Full_Middle_Initial=???
Reduced_Middle_Initial=???
Surname=???
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If someone has the code that I could use in order to get my desired outcome/provide the appropriate code where I have put question marks in the creation of the variables, that would be greatly appreciated!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It isn't a simple as that, you need to consider several different variations of possibilities and account for them all. What happens if the name is Mr. John Smith? Or Mrs. Melissa A Adams? There are several different variations you need to account for. That being said, this is a common homework/job application question which is asked on here often. If you search you'll find examples.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Parsing full names into name components is hard. How many names do you have and are they reasonably 'clean' or not? What result quality is expected? Please bear in mind that surnames can be multiple words and names can contain titles and suffixes. Logic to deal with this will quickly get very complicated.
SAS Data Quality contains intelligent name parsing routines to deal with this but I doubt you have that available to you.
If a 90% job is good enough and you haven't got millions of names then you might get lucky and do the outliers manually.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I suspect you are in luck then as the cleaner and more consistent the data is the easer it is to process. I suggest you start from the left and see if you can identify all titles. Something like this should work:
data want;
set have;
if scan(upcase(full_name),1) in ('MR','MS','MISS') then Title = scan(full_name, 1);
run;
Then check what you have left out and what is selected but shouldn't be.