- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a character variable with a string of first name and possibly middle initials or Jrs, Sr, etc. I want to scan the string and output the longest text in the string, assuming this will be the first name. I was using the following code:
data new;
set old;
firstname= scan(fname,1, ' ');
drop fname;
run;
to pull the first text before the space assuming it would be the first name, but some observations have middle initial before first name.
- Tags:
- scan
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just continue searching. Assuming there are no more than 5 words in FNAME:
data want;
set have;
firstname = scan(fname, 1, ' ');
do k=2 to 5;
test = scan(fname, k, ' ');
if length(test) > length(firstname) then firstname=test;
end;
drop test k fname;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just continue searching. Assuming there are no more than 5 words in FNAME:
data want;
set have;
firstname = scan(fname, 1, ' ');
do k=2 to 5;
test = scan(fname, k, ' ');
if length(test) > length(firstname) then firstname=test;
end;
drop test k fname;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You may want to add a flag or possibly a note in the log if the longest "word" is something like III (for "the third") or any of your other suspect values like Junior or Senior spelled out in case you have values like "John Junior".
You may also want to look at your actual "word" count as some ethnicities may have a single name consisting or two or more "words". If the actual first name is something like "Bo Na Thone" or "Glory Moonshine" then keeping, our using Thone or Moonshine could be very incorrect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
True, but if you have thousands of names, any idea on how you could efficiently be able to identify those?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Melk wrote:
True, but if you have thousands of names, any idea on how you could efficiently be able to identify those?
The only truly robust method I know of is to change the data collection/entry to enforce first name, last name, middle(other) name, and a field for Jr, Sr, Dr, Ms Mrs or what have you.
I have had to clean up such data where names were entered all in one field and identifying multiple word name segments was only one of the headaches. I did have thousands of names which took lots of hours to clean up. I noted the processing time/costs to the clients for such things.
I minor bit was searching the name string first for the words like II, III, IV, JR, 2nd, 3rd and moving them to another variable before the next processing. Note that II and JR (or ii) really wants to use something like FINDW or INDEXW because that can appear as part of names though unlikely as stand alone words. TRANWRD could be used to remove specific known words like this.