DATA Step, Macro, Functions and more

Variable split

Reply
Super Contributor
Posts: 426

Variable split

I've a data like this,

Dataset name: name_split

Fullname_new
LAWRENCE P POWERS, III
David R. Fox
ROBERT STEVEN SHAPIRO
JOHN E KIRK, IV
A JOHN KIRK

I want data like this,

FirstnameSecondnameLastnameSuffix
LAWRENCEPPOWERSIII
DavidRFox
ROBERTSTEVENSHAPIRO
JOHN EKIRKIV
JOHN AKIRK

I've tried this code,

data new_name;

        set name_split;

               

        first_name=scan(fullname_new,1, ' ');

        Middle_name=scan(fullname_new,2, ' ');

        last_name=scan(fullname_new,3, ' ');

                if length(Middle_name) gt 2 then

                last_name=Middle_name;

        if last_name=Middle_name then

                Middle_name=' ';

run;

data new_name_suff;

        set name_split;

        suffix=scan(fullname_new,-1,',');

        if length(suffix) gt 4 then

                delete;

run;

proc sort data=new_name;

        by fullname_new;

run;

proc sort data=new_name_suff;

        by fullname_new;

run;

data name;

        update new_name new_name_suff;

        by fullname_new;

run;

data fullname;

set name;

if length(first_name) lt 2 then

Middle_name=scan(fullname_new,1, ' ');

First_name=scan(fullname_new,2, ' ');

last_name=scan(fullname_new,3, ' ');

run;


data fullname1;

        update name fullname;

        by fullname_new;

run;

Although this code works with the data which I've, I'm worried if either middle name or suffix exceeds (in future data) greater than 2 or 4 characters. There should not be any special characters in derived fields as well.

I would appreciate for any help you offer me.

   

  

Valued Guide
Posts: 3,208

Re: Variable split

Your question is an unsolvable one.

- How can you know whether it is a secondname or a double word lastname

- think about this one CSV comma separated. Than de decimal parts are comma delimited.  What is the meaning of a comma?

Unless you have some additional information you cannot segregate.

---->-- ja karman --<-----
Super Contributor
Posts: 426

Re: Variable split

- How can you know whether it is a secondname or a double word lastname - letter with more than two characters are treated as last name.

- think about this one CSV comma separated. Than de decimal parts are comma delimited.  What is the meaning of a comma? - commas are from flat file which we need to remove during\after split but before result.

Unless you have some additional information you cannot segregate. - any possibilities to optimize my code or generate a new code to produce the result as I mentioned?

Valued Guide
Posts: 3,208

Re: Variable split

There are options in the Input statement using pointer controls SAS(R) 9.4 Statements: Reference, Third Edition

or allow reading space (& : ~) SAS(R) 9.4 Statements: Reference, Third Edition

Using fcmp with own defined formats or the automatic _infile_ variable used with SAS functions

In case the basic string /input functions of SAS are not sufficient you can use http://www2.sas.com/proceedings/sugi29/265-29.pdf

More possibilities to optimize and change as I would like to try.

If you are learning than it is getting aware of those. If you are delivering a project it is applying those with coding at cost/profit working hours. At what one are you in, I was guessing learning.

---->-- ja karman --<-----
Super Contributor
Posts: 275

Re: Variable split

data have;

pid=prxparse('/(\w+) (\w+\.?) (\w+)\,?(\D+)/');

input fullname $50.;

if prxmatch('/^\S\b/',fullname)>0 then fullname=prxchange('s/^(\S\b)(\s\w+)(\s.*)/$2 $1$3/',-1,fullname);

if prxmatch(pid,fullname) then

   firstname=prxposn(pid,1,fullname);

   middlename=prxposn(pid,2,fullname);

   lastname=prxposn(pid,3,fullname);

   suffix=prxposn(pid,4,fullname);

cards;

LAWRENCE P POWERS, III

David R. Fox

ROBERT STEVEN SHAPIRO

JOHN E KIRK, IV

A JOHN KIRK

;

proc print;

run;

Super Contributor
Posts: 426

Re: Variable split

Fantastic!

May I request you to tell me about \w+ in prxparse and \S\b and /$2 $1$3/ in prxmatch?

Super User
Posts: 10,466

Re: Variable split

Something else to consider besides last names with spaces are hyphenated names such as Jonh J Cartwright-Pickering.

Also assuming 2 characters is a middle name fails with many nationalities.

How many records do you have to process and will the source continue to NOT have moved into 1990's level database structures of having separate fields?

If you're going to have this going on into the future you really want to bring issues like this to whoever is collecting the data that there may be much more cost and likely errors in trying to post-process names than the cost to collect them in a better format.

Super User
Posts: 9,662

Re: Variable split

data have;
input Fullname_new $40.;
cards;
LAWRENCE P POWERS, III
David R. Fox
ROBERT STEVEN SHAPIRO
JOHN E KIRK, IV
A JOHN KIRK
;
run;
data want;
 set have;
 length Firstname     Secondname     Lastname     Suffix temp $ 40;
  Firstname=scan(Fullname_new,1,,'ps');
  Secondname=scan(Fullname_new,2,,'ps');
  Lastname=scan(Fullname_new,3,,'ps');
  Suffix=scan(Fullname_new,4,,'ps');
 if length(Firstname)=1 then do;temp=Firstname;Firstname=Secondname;Secondname=temp;end;
 drop temp;
run;


Xia Keshan

Ask a Question
Discussion stats
  • 7 replies
  • 444 views
  • 1 like
  • 5 in conversation