I have a character variable with a string of first name and possibly middle initials or Jrs, Sr, etc. I want to scan the string and output the longest text in the string, assuming this will be the first name. I was using the following code:
data new;
set old;
firstname= scan(fname,1, ' ');
drop fname;
run;
to pull the first text before the space assuming it would be the first name, but some observations have middle initial before first name.
Just continue searching. Assuming there are no more than 5 words in FNAME:
data want;
set have;
firstname = scan(fname, 1, ' ');
do k=2 to 5;
test = scan(fname, k, ' ');
if length(test) > length(firstname) then firstname=test;
end;
drop test k fname;
run;
Just continue searching. Assuming there are no more than 5 words in FNAME:
data want;
set have;
firstname = scan(fname, 1, ' ');
do k=2 to 5;
test = scan(fname, k, ' ');
if length(test) > length(firstname) then firstname=test;
end;
drop test k fname;
run;
You may want to add a flag or possibly a note in the log if the longest "word" is something like III (for "the third") or any of your other suspect values like Junior or Senior spelled out in case you have values like "John Junior".
You may also want to look at your actual "word" count as some ethnicities may have a single name consisting or two or more "words". If the actual first name is something like "Bo Na Thone" or "Glory Moonshine" then keeping, our using Thone or Moonshine could be very incorrect.
True, but if you have thousands of names, any idea on how you could efficiently be able to identify those?
@Melk wrote:
True, but if you have thousands of names, any idea on how you could efficiently be able to identify those?
The only truly robust method I know of is to change the data collection/entry to enforce first name, last name, middle(other) name, and a field for Jr, Sr, Dr, Ms Mrs or what have you.
I have had to clean up such data where names were entered all in one field and identifying multiple word name segments was only one of the headaches. I did have thousands of names which took lots of hours to clean up. I noted the processing time/costs to the clients for such things.
I minor bit was searching the name string first for the words like II, III, IV, JR, 2nd, 3rd and moving them to another variable before the next processing. Note that II and JR (or ii) really wants to use something like FINDW or INDEXW because that can appear as part of names though unlikely as stand alone words. TRANWRD could be used to remove specific known words like this.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.