Solved: Extract Name

hkim3677 · Posted 03-14-2018 11:57 PM

Hi,

Could you please help me extract first name of full name?

Data test;

input id fullname $;

1 Michael R. Boyce

2 James G. Brocksmith, Jr.

3 Gerald F. Fitzgerald, Jr.

4 Norman R. Bobins, BS, MBA

5 Peter Pace, USMC, Ret.

6 Ryan, Norman A.

7 Anne Newman Foreman

;

run;

I need to create "first_name" column with middle name and without middle name and "last_name" using the fullname information.

As you can see above, In ID=6, the full name start with the last name unlike other observations. If it is the case, I need to re-order it to be consistent with others.

Like...

Data test; set test;

input ID First_name_with_middle $ Last_name $ First_name_without_middle $

datalines;

1 Michael R. Boyce Michael

2 James G. Brocksmith James

3 Gerald F. Fitzgerald Gerald

4 Norman R. Bobins Norman

5 Peter Pace Peter

6 Norman A. Ryan Norman

7 Anne Newman Foreman Anne

;

run;

That is, the first name with middle name column should contain the middle name abbreviation as well.

And, the last name column only contains last name of the full name.

Please help me!

PGStats · Posted 03-15-2018 01:47 AM

PRX to the rescue...

Data test;
input id fullname $64.;
datalines;
1 Michael R. Boyce
2 James G. Brocksmith, Jr.
3 Gerald F. Fitzgerald, Jr.
4 Norman R. Bobins, BS, MBA
5 Peter Pace, USMC, Ret.
6 Ryan, Norman A.
7 Anne Newman Foreman
8 Michael R.S. Boyce
9 Ryan, Norman A., BS
;

data want;
length firstName lastName $20 initials $6;
if not prxId1 then prxId1 + prxParse("/^(\w+)\s+(([A-Z]\.)*\s*)([^,]+)/");
if not prxId2 then prxId2 + prxParse("/^(\w+),\s*(\w+)\s+(([A-Z]\.)*)/");
set test;
if prxMatch(prxId1, fullname) then do;
    firstName = prxPosn(prxId1, 1, fullname);
    initials = prxPosn(prxId1, 2, fullname);
    lastName = prxPosn(prxId1, 4, fullname);
    end;
else if prxMatch(prxId2, fullname) then do;
    firstName = prxPosn(prxId2, 2, fullname);
    initials = prxPosn(prxId2, 3, fullname);
    lastName = prxPosn(prxId2, 1, fullname);
    end;
drop prxId: ;
run;

proc print data=want noobs; run;

      first
      Name       lastName          initials    id    fullname

      Michael    Boyce               R.         1    Michael R. Boyce
      James      Brocksmith          G.         2    James G. Brocksmith, Jr.
      Gerald     Fitzgerald          F.         3    Gerald F. Fitzgerald, Jr.
      Norman     Bobins              R.         4    Norman R. Bobins, BS, MBA
      Peter      Pace                           5    Peter Pace, USMC, Ret.
      Norman     Ryan                A.         6    Ryan, Norman A.
      Anne       Newman Foreman                 7    Anne Newman Foreman
      Michael    Boyce               R.S.       8    Michael R.S. Boyce
      Norman     Ryan                A.         9    Ryan, Norman A., BS

PG

View solution in original post

kiranv_ · Posted 03-15-2018 01:23 AM

I just got confused with variables you want in your output dataset

hkim3677 · Posted 03-15-2018 01:47 AM

Sorry. The wanted dataset should look like..

Data test;
infile datalines DSD;
input ID First_name_with_middle $ Last_name $ First_name_without_middle $ ;
datalines;
1, Michael R., Boyce, Michael
2, James G., Brocksmith, James
3, Gerald F., Fitzgerald, Gerald
4, Norman R., Bobins, Norman
5, Peter, Pace, Peter
6, Norman A., Ryan, Norman
7, Anne Newman, Foreman, Anne
;

run;

But, when I ran the code above, First_name_with_middle column shows that some observations do not have a dot after middle name abbreviation. I do not know why... but I need the dot after the middle name abbreviation if there is a middle name abbreviation.

PGStats · Posted 03-15-2018 01:47 AM

PRX to the rescue...

Data test;
input id fullname $64.;
datalines;
1 Michael R. Boyce
2 James G. Brocksmith, Jr.
3 Gerald F. Fitzgerald, Jr.
4 Norman R. Bobins, BS, MBA
5 Peter Pace, USMC, Ret.
6 Ryan, Norman A.
7 Anne Newman Foreman
8 Michael R.S. Boyce
9 Ryan, Norman A., BS
;

data want;
length firstName lastName $20 initials $6;
if not prxId1 then prxId1 + prxParse("/^(\w+)\s+(([A-Z]\.)*\s*)([^,]+)/");
if not prxId2 then prxId2 + prxParse("/^(\w+),\s*(\w+)\s+(([A-Z]\.)*)/");
set test;
if prxMatch(prxId1, fullname) then do;
    firstName = prxPosn(prxId1, 1, fullname);
    initials = prxPosn(prxId1, 2, fullname);
    lastName = prxPosn(prxId1, 4, fullname);
    end;
else if prxMatch(prxId2, fullname) then do;
    firstName = prxPosn(prxId2, 2, fullname);
    initials = prxPosn(prxId2, 3, fullname);
    lastName = prxPosn(prxId2, 1, fullname);
    end;
drop prxId: ;
run;

proc print data=want noobs; run;

      first
      Name       lastName          initials    id    fullname

      Michael    Boyce               R.         1    Michael R. Boyce
      James      Brocksmith          G.         2    James G. Brocksmith, Jr.
      Gerald     Fitzgerald          F.         3    Gerald F. Fitzgerald, Jr.
      Norman     Bobins              R.         4    Norman R. Bobins, BS, MBA
      Peter      Pace                           5    Peter Pace, USMC, Ret.
      Norman     Ryan                A.         6    Ryan, Norman A.
      Anne       Newman Foreman                 7    Anne Newman Foreman
      Michael    Boyce               R.S.       8    Michael R.S. Boyce
      Norman     Ryan                A.         9    Ryan, Norman A., BS

PG

hkim3677 · Posted 03-15-2018 01:49 AM

Thank you very much!

hkim3677 · Posted 03-15-2018 02:20 AM

Just one more quick question.

If I have the fullname.. like...

Lavizzo-Mourey, Evans

B. Evan Bayh, III

Those names are not correctly divided into firstname and lastname using the above code.

Thank you!

PGStats · Posted 03-15-2018 02:59 PM

The possibilities are endless... You can account for hyphenated names and firstnames easily, but frankly I don't even know what's what in "B. Evan Bayh, III". If such "names" are rare, you might be better sending them to a separate table and treating them by hand...

Data test;
input id fullname $64.;
datalines;
1 Michael R. Boyce
2 James G. Brocksmith, Jr.
3 Gerald F. Fitzgerald, Jr.
4 Norman R. Bobins, BS, MBA
5 Peter Pace, USMC, Ret.
6 Ryan, Norman A.
7 Anne Newman Foreman
8 Michael R.S. Boyce
9 Ryan, Norman A., BS
10 Lavizzo-Mourey, Evans
11 B. Evan Bayh, III
;

data want except(keep=id fullname);
length firstName lastName $20 initials $6;
if not prxId1 then prxId1 + prxParse("/^([\w-]+)\s+(([A-Z]\.)*\s*)([^,]+)/");
if not prxId2 then prxId2 + prxParse("/^([\w-]+),\s*([\w-]+)\s+(([A-Z]\.)*)/");
set test;
if prxMatch(prxId1, fullname) then do;
    firstName = prxPosn(prxId1, 1, fullname);
    initials = prxPosn(prxId1, 2, fullname);
    lastName = prxPosn(prxId1, 4, fullname);
    output want;
    end;
else if prxMatch(prxId2, fullname) then do;
    firstName = prxPosn(prxId2, 2, fullname);
    initials = prxPosn(prxId2, 3, fullname);
    lastName = prxPosn(prxId2, 1, fullname);
    output want;
    end;
else output except;
drop prxId: ;
run;

title "Matched names";
proc print data=want noobs; run;

title "Unmatched names";
proc print data=except noobs; run;

PG

Extract Name

Re: Extract Name

Re: Extract Name

Re: Extract Name

Re: Extract Name

Re: Extract Name

Re: Extract Name

Re: Extract Name

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away