I've a data like this,
Dataset name: name_split
Fullname_new |
LAWRENCE P POWERS, III |
David R. Fox |
ROBERT STEVEN SHAPIRO |
JOHN E KIRK, IV |
A JOHN KIRK |
I want data like this,
Firstname | Secondname | Lastname | Suffix |
LAWRENCE | P | POWERS | III |
David | R | Fox | |
ROBERT | STEVEN | SHAPIRO | |
JOHN | E | KIRK | IV |
JOHN | A | KIRK |
I've tried this code,
data new_name;
set name_split;
first_name=scan(fullname_new,1, ' ');
Middle_name=scan(fullname_new,2, ' ');
last_name=scan(fullname_new,3, ' ');
if length(Middle_name) gt 2 then
last_name=Middle_name;
if last_name=Middle_name then
Middle_name=' ';
run;
data new_name_suff;
set name_split;
suffix=scan(fullname_new,-1,',');
if length(suffix) gt 4 then
delete;
run;
proc sort data=new_name;
by fullname_new;
run;
proc sort data=new_name_suff;
by fullname_new;
run;
data name;
update new_name new_name_suff;
by fullname_new;
run;
data fullname;
set name;
if length(first_name) lt 2 then
Middle_name=scan(fullname_new,1, ' ');
First_name=scan(fullname_new,2, ' ');
last_name=scan(fullname_new,3, ' ');
run;
data fullname1;
update name fullname;
by fullname_new;
run;
Although this code works with the data which I've, I'm worried if either middle name or suffix exceeds (in future data) greater than 2 or 4 characters. There should not be any special characters in derived fields as well.
I would appreciate for any help you offer me.
Your question is an unsolvable one.
- How can you know whether it is a secondname or a double word lastname
- think about this one CSV comma separated. Than de decimal parts are comma delimited. What is the meaning of a comma?
Unless you have some additional information you cannot segregate.
- How can you know whether it is a secondname or a double word lastname - letter with more than two characters are treated as last name.
- think about this one CSV comma separated. Than de decimal parts are comma delimited. What is the meaning of a comma? - commas are from flat file which we need to remove during\after split but before result.
Unless you have some additional information you cannot segregate. - any possibilities to optimize my code or generate a new code to produce the result as I mentioned?
There are options in the Input statement using pointer controls SAS(R) 9.4 Statements: Reference, Third Edition
or allow reading space (& : ~) SAS(R) 9.4 Statements: Reference, Third Edition
Using fcmp with own defined formats or the automatic _infile_ variable used with SAS functions
In case the basic string /input functions of SAS are not sufficient you can use http://www2.sas.com/proceedings/sugi29/265-29.pdf
More possibilities to optimize and change as I would like to try.
If you are learning than it is getting aware of those. If you are delivering a project it is applying those with coding at cost/profit working hours. At what one are you in, I was guessing learning.
data have;
pid=prxparse('/(\w+) (\w+\.?) (\w+)\,?(\D+)/');
input fullname $50.;
if prxmatch('/^\S\b/',fullname)>0 then fullname=prxchange('s/^(\S\b)(\s\w+)(\s.*)/$2 $1$3/',-1,fullname);
if prxmatch(pid,fullname) then
firstname=prxposn(pid,1,fullname);
middlename=prxposn(pid,2,fullname);
lastname=prxposn(pid,3,fullname);
suffix=prxposn(pid,4,fullname);
cards;
LAWRENCE P POWERS, III
David R. Fox
ROBERT STEVEN SHAPIRO
JOHN E KIRK, IV
A JOHN KIRK
;
proc print;
run;
Fantastic!
May I request you to tell me about \w+ in prxparse and \S\b and /$2 $1$3/ in prxmatch?
Something else to consider besides last names with spaces are hyphenated names such as Jonh J Cartwright-Pickering.
Also assuming 2 characters is a middle name fails with many nationalities.
How many records do you have to process and will the source continue to NOT have moved into 1990's level database structures of having separate fields?
If you're going to have this going on into the future you really want to bring issues like this to whoever is collecting the data that there may be much more cost and likely errors in trying to post-process names than the cost to collect them in a better format.
data have; input Fullname_new $40.; cards; LAWRENCE P POWERS, III David R. Fox ROBERT STEVEN SHAPIRO JOHN E KIRK, IV A JOHN KIRK ; run; data want; set have; length Firstname Secondname Lastname Suffix temp $ 40; Firstname=scan(Fullname_new,1,,'ps'); Secondname=scan(Fullname_new,2,,'ps'); Lastname=scan(Fullname_new,3,,'ps'); Suffix=scan(Fullname_new,4,,'ps'); if length(Firstname)=1 then do;temp=Firstname;Firstname=Secondname;Secondname=temp;end; drop temp; run;
Xia Keshan
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.