I have a variable called 'FULL_NAME'. which contains a person's full name. However, I have been asked to create the following variables:
-Title (e.g. Mr, Ms etc.)
-Forename
-Middle_Names
-Full_Middle_Initial (all of the middle name initials in full)
-Reduced_Middle_Initial (just the first initial of any middle names e.g. if Peter, Owen and Dennis are the three middle names, this variable would just contain the initial of of the first middle name: P)
-Surname
For example. if there is a person called "Mr James Peter Owen Dennis Coyle-Harper", what code is best to
get the following results for the variables I have listed above? The desired outcome in my example would work for my all of my 100,000 accounts:
-Title: Mr
-Forename: James
-Middle_Names: Peter Owen Dennis
-Full_Middle_Initial: P O D
-Reduced_Middle_Initial: P
-Surname: Coyle-Harper
I have currently got code that uses the scan function, but I am not sure what the best method would be to identify the middle names as there could be multiple, and the surname would just be the last word from the 'FULL_NAME' variable.
data test;
set Name_Information;
Title=scan(FULL_NAME,1,' ');
Forename=scan(FULL_NAME,2,' ');
Middle_Names=???
Full_Middle_Initial=???
Reduced_Middle_Initial=???
Surname=???
run;
If someone has the code that I could use in order to get my desired outcome/provide the appropriate code where I have put question marks in the creation of the variables, that would be greatly appreciated!
It isn't a simple as that, you need to consider several different variations of possibilities and account for them all. What happens if the name is Mr. John Smith? Or Mrs. Melissa A Adams? There are several different variations you need to account for. That being said, this is a common homework/job application question which is asked on here often. If you search you'll find examples.
Parsing full names into name components is hard. How many names do you have and are they reasonably 'clean' or not? What result quality is expected? Please bear in mind that surnames can be multiple words and names can contain titles and suffixes. Logic to deal with this will quickly get very complicated.
SAS Data Quality contains intelligent name parsing routines to deal with this but I doubt you have that available to you.
If a 90% job is good enough and you haven't got millions of names then you might get lucky and do the outliers manually.
I suspect you are in luck then as the cleaner and more consistent the data is the easer it is to process. I suggest you start from the left and see if you can identify all titles. Something like this should work:
data want;
set have;
if scan(upcase(full_name),1) in ('MR','MS','MISS') then Title = scan(full_name, 1);
run;
Then check what you have left out and what is selected but shouldn't be.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.