Hi all,
I have a very, very large dataset with a variable that should hold full names. The names should be formatted as Last, First (separated by a comma) or First Last (separated by a space), but there is dirt. There can also be middle names after the first names in both of those examples:
v1
John Smith
Dan Fred Jones
Doe, Jane
Hamilton, Diane A
I'm trying to do a bunch of cleaning and I could use some help with how to set up one of the more complex steps.
Since in the data, last names are much, much more likely to be hyphenated than first names, I'm applying the following rules:
1. If a value has one (and only one) hyphen, check to see the format of the value
2. If the value has a comma in it, the hyphenated word should become the first word in the string (last name position) if it is not already
3. If the value has no comma, the hyphenated word moves to the end of the string if it is not already (last name position)
Values with 0 hyphens, or >1 hyphen, should be left intact.
Example:
before
v1
Jane Doe
Smith-John-F
Jones-Anderson, Dan
Kate, Simons-Hunt
Parker-Parks Ashley
after
v1
Jane Doe
Smith-John-F
Jones-Anderson, Dan
Simons-Hunt, Kate
Ashley Parker-Parks
The first two do not have one and only one hyphen so they are left alone. The third has the hyphenated name in the correct location already, so it is left alone. The next one (in red) has only one hyphen, so it needs to be cleaned. The value has a comma, so the hyphenated name should be the first word in the string, which it is not. This value would need to be cleaned. Similarly, the last record only has one hyphen. It has no comma so the hyphenated word should be the last word in the string, which it is not. This value would need to be cleaned.
Any help is much appreciated. I haven't been able to figure out an approach yet, so I can't provide what I have so far.
Since those with (0, >1) hyphens are to be left alone, I check for that first. After that, just check for a comma and position of the hyphen relative to the comma. If there is no comma, check for the position of the hyphen relative to the space. I've only reassigned the values when necessary.
data have;
  infile cards truncover;
  input Name $300.;
  cards;
Jane Doe
Smith-John-F
Jones-Anderson, Dan
Kate, Simons-Hunt
Parker-Parks Ashley
  ;
run;
data want;
  set have;
  if countc(name,'-') eq 1 then do;
      if countc(name,',') eq 1 and countc(scan(name,1,','),'-') eq 0 
          then name = catx(', ',scan(name,2,','),scan(name,1,','));
        else if countc(name,',') eq 0 and countc(scan(name,2,' '),'-') eq 0 
            then name = catx(' ',scan(name,2,' '),scan(name,1,' '));
    end;
run;Since those with (0, >1) hyphens are to be left alone, I check for that first. After that, just check for a comma and position of the hyphen relative to the comma. If there is no comma, check for the position of the hyphen relative to the space. I've only reassigned the values when necessary.
data have;
  infile cards truncover;
  input Name $300.;
  cards;
Jane Doe
Smith-John-F
Jones-Anderson, Dan
Kate, Simons-Hunt
Parker-Parks Ashley
  ;
run;
data want;
  set have;
  if countc(name,'-') eq 1 then do;
      if countc(name,',') eq 1 and countc(scan(name,1,','),'-') eq 0 
          then name = catx(', ',scan(name,2,','),scan(name,1,','));
        else if countc(name,',') eq 0 and countc(scan(name,2,' '),'-') eq 0 
            then name = catx(' ',scan(name,2,' '),scan(name,1,' '));
    end;
run;That worked, thank you!
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
