Re: Variable split

Babloo · Posted 04-03-2015 08:23 AM

I've a data like this,

Dataset name: name_split

Fullname_new

LAWRENCE P POWERS, III

David R. Fox

ROBERT STEVEN SHAPIRO

JOHN E KIRK, IV

A JOHN KIRK

I want data like this,

Firstname	Secondname	Lastname	Suffix
LAWRENCE	P	POWERS	III
David	R	Fox
ROBERT	STEVEN	SHAPIRO
JOHN	E	KIRK	IV
JOHN	A	KIRK

I've tried this code,

data new_name;

set name_split;

first_name=scan(fullname_new,1, ' ');

Middle_name=scan(fullname_new,2, ' ');

last_name=scan(fullname_new,3, ' ');

if length(Middle_name) gt 2 then

last_name=Middle_name;

if last_name=Middle_name then

Middle_name=' ';

run;

data new_name_suff;

set name_split;

suffix=scan(fullname_new,-1,',');

if length(suffix) gt 4 then

delete;

run;

proc sort data=new_name;

by fullname_new;

run;

proc sort data=new_name_suff;

by fullname_new;

run;

data name;

update new_name new_name_suff;

by fullname_new;

run;

data fullname;

set name;

if length(first_name) lt 2 then

Middle_name=scan(fullname_new,1, ' ');

First_name=scan(fullname_new,2, ' ');

last_name=scan(fullname_new,3, ' ');

run;

data fullname1;

update name fullname;

by fullname_new;

run;

Although this code works with the data which I've, I'm worried if either middle name or suffix exceeds (in future data) greater than 2 or 4 characters. There should not be any special characters in derived fields as well.

I would appreciate for any help you offer me.

jakarman · Posted 04-03-2015 09:49 AM

Your question is an unsolvable one.

- How can you know whether it is a secondname or a double word lastname

- think about this one CSV comma separated. Than de decimal parts are comma delimited. What is the meaning of a comma?

Unless you have some additional information you cannot segregate.

---->-- ja karman --<-----

Babloo · Posted 04-03-2015 10:04 AM

- How can you know whether it is a secondname or a double word lastname - letter with more than two characters are treated as last name.

- think about this one CSV comma separated. Than de decimal parts are comma delimited. What is the meaning of a comma? - commas are from flat file which we need to remove during\after split but before result.

Unless you have some additional information you cannot segregate. - any possibilities to optimize my code or generate a new code to produce the result as I mentioned?

jakarman · Posted 04-03-2015 10:28 AM

There are options in the Input statement using pointer controls SAS(R) 9.4 Statements: Reference, Third Edition

or allow reading space (& : ~) SAS(R) 9.4 Statements: Reference, Third Edition

Using fcmp with own defined formats or the automatic _infile_ variable used with SAS functions

In case the basic string /input functions of SAS are not sufficient you can use http://www2.sas.com/proceedings/sugi29/265-29.pdf

More possibilities to optimize and change as I would like to try.

If you are learning than it is getting aware of those. If you are delivering a project it is applying those with coding at cost/profit working hours. At what one are you in, I was guessing learning.

---->-- ja karman --<-----

slchen · Posted 04-03-2015 10:34 AM

data have;

pid=prxparse('/(\w+) (\w+\.?) (\w+)\,?(\D+)/');

input fullname $50.;

if prxmatch('/^\S\b/',fullname)>0 then fullname=prxchange('s/^(\S\b)(\s\w+)(\s.*)/$2 $1$3/',-1,fullname);

if prxmatch(pid,fullname) then

firstname=prxposn(pid,1,fullname);

middlename=prxposn(pid,2,fullname);

lastname=prxposn(pid,3,fullname);

suffix=prxposn(pid,4,fullname);

cards;

LAWRENCE P POWERS, III

David R. Fox

ROBERT STEVEN SHAPIRO

JOHN E KIRK, IV

A JOHN KIRK

;

proc print;

run;

Babloo · Posted 04-03-2015 10:44 AM

Fantastic!

May I request you to tell me about \w+ in prxparse and \S\b and /$2 $1$3/ in prxmatch?

ballardw · Posted 04-03-2015 10:57 AM

Something else to consider besides last names with spaces are hyphenated names such as Jonh J Cartwright-Pickering.

Also assuming 2 characters is a middle name fails with many nationalities.

How many records do you have to process and will the source continue to NOT have moved into 1990's level database structures of having separate fields?

If you're going to have this going on into the future you really want to bring issues like this to whoever is collecting the data that there may be much more cost and likely errors in trying to post-process names than the cost to collect them in a better format.

Ksharp · Posted 04-03-2015 11:07 AM

data have;
input Fullname_new $40.;
cards;
LAWRENCE P POWERS, III
David R. Fox
ROBERT STEVEN SHAPIRO
JOHN E KIRK, IV
A JOHN KIRK
;
run;
data want;
 set have;
 length Firstname     Secondname     Lastname     Suffix temp $ 40;
  Firstname=scan(Fullname_new,1,,'ps');
  Secondname=scan(Fullname_new,2,,'ps');
  Lastname=scan(Fullname_new,3,,'ps');
  Suffix=scan(Fullname_new,4,,'ps');
 if length(Firstname)=1 then do;temp=Firstname;Firstname=Secondname;Secondname=temp;end;
 drop temp;
run;

Xia Keshan

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away