BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Melk
Lapis Lazuli | Level 10

I have a character variable with a string of first name and possibly middle initials or Jrs, Sr, etc. I want to scan the string and output the longest text in the string, assuming this will be the first name. I was using the following code:

 

data new;
    set old;
        firstname= scan(fname,1, ' ');   
        drop fname;
run;

 

to pull the first text before the space assuming it would be the first name, but some observations have middle initial before first name.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Just continue searching.  Assuming there are no more than 5 words in FNAME:

 

data want;

set have;

firstname = scan(fname, 1, ' ');

do k=2 to 5;

   test = scan(fname, k, ' ');

   if length(test) > length(firstname) then firstname=test;

end;

drop test k fname;

run;

View solution in original post

4 REPLIES 4
Astounding
PROC Star

Just continue searching.  Assuming there are no more than 5 words in FNAME:

 

data want;

set have;

firstname = scan(fname, 1, ' ');

do k=2 to 5;

   test = scan(fname, k, ' ');

   if length(test) > length(firstname) then firstname=test;

end;

drop test k fname;

run;

ballardw
Super User

You may want to add a flag or possibly a note in the log if the longest "word" is something like III (for "the third") or any of your other suspect values like Junior or Senior spelled out in case you have values like "John Junior".

 

You may also want to look at your actual "word" count as some ethnicities may have a single name consisting or two or more "words". If the actual first name is something like "Bo Na Thone" or "Glory Moonshine" then keeping, our using Thone or Moonshine could be very incorrect.

Melk
Lapis Lazuli | Level 10

True, but if you have thousands of names, any idea on how you could efficiently be able to identify those?

ballardw
Super User

@Melk wrote:

True, but if you have thousands of names, any idea on how you could efficiently be able to identify those?


The only truly robust method I know of is to change the data collection/entry to enforce first name, last name, middle(other) name, and a field for Jr, Sr, Dr, Ms Mrs or what have you.

 

I have had to clean up such data where names were entered all in one field and identifying multiple word name segments was only one of the headaches. I did have thousands of names which took lots of hours to clean up. I noted the processing time/costs to the clients for such things.

 

I minor bit was searching the name string first for the words like II, III, IV, JR, 2nd, 3rd and moving them to another variable before the next processing. Note that II and JR (or ii) really wants to use something like FINDW or INDEXW because that can appear as part of names though unlikely as stand alone words. TRANWRD could be used to remove specific known words like this.

 

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 824 views
  • 3 likes
  • 3 in conversation