BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Walternate
Obsidian | Level 7

Hi all,

 

I have a very, very large dataset with a variable that should hold full names. The names should be formatted as Last, First  (separated by a comma) or First Last (separated by a space), but there is dirt. There can also be middle names after the first names in both of those examples:

 

v1

John Smith

Dan Fred Jones

Doe, Jane

Hamilton, Diane A

 

I'm trying to do a bunch of cleaning and I could use some help with how to set up one of the more complex steps. 

 

Since in the data, last names are much, much more likely to be hyphenated than first names, I'm applying the following rules:

 

1. If a value has one (and only one) hyphen, check to see the format of the value

2. If the value has a comma in it, the hyphenated word should become the first word in the string (last name position) if it is not already

3. If the value has no comma, the hyphenated word moves to the end of the string if it is not already (last name position)

 

Values with 0 hyphens, or >1 hyphen, should be left intact. 

 

Example:

 

before

v1

Jane Doe

Smith-John-F

Jones-Anderson, Dan

Kate, Simons-Hunt

Parker-Parks Ashley

 

after

v1

Jane Doe

Smith-John-F

Jones-Anderson, Dan

Simons-Hunt, Kate

Ashley Parker-Parks

 

The first two do not have one and only one hyphen so they are left alone. The third has the hyphenated name in the correct location already, so it is left alone. The next one (in red) has only one hyphen, so it needs to be cleaned. The value has a comma, so the hyphenated name should be the first word in the string, which it is not. This value would need to be cleaned. Similarly, the last record only has one hyphen. It has no comma so the hyphenated word should be the last word in the string, which it is not. This value would need to be cleaned. 

 

Any help is much appreciated. I haven't been able to figure out an approach yet, so I can't provide what I have so far.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Duggins
Obsidian | Level 7

Since those with (0, >1) hyphens are to be left alone, I check for that first. After that, just check for a comma and position of the hyphen relative to the comma. If there is no comma, check for the position of the hyphen relative to the space. I've only reassigned the values when necessary.

 

data have;
  infile cards truncover;
  input Name $300.;
  cards;
Jane Doe
Smith-John-F
Jones-Anderson, Dan
Kate, Simons-Hunt
Parker-Parks Ashley
  ;
run;

data want;
  set have;
  if countc(name,'-') eq 1 then do;
      if countc(name,',') eq 1 and countc(scan(name,1,','),'-') eq 0 
          then name = catx(', ',scan(name,2,','),scan(name,1,','));
        else if countc(name,',') eq 0 and countc(scan(name,2,' '),'-') eq 0 
            then name = catx(' ',scan(name,2,' '),scan(name,1,' '));
    end;
run;

View solution in original post

2 REPLIES 2
Duggins
Obsidian | Level 7

Since those with (0, >1) hyphens are to be left alone, I check for that first. After that, just check for a comma and position of the hyphen relative to the comma. If there is no comma, check for the position of the hyphen relative to the space. I've only reassigned the values when necessary.

 

data have;
  infile cards truncover;
  input Name $300.;
  cards;
Jane Doe
Smith-John-F
Jones-Anderson, Dan
Kate, Simons-Hunt
Parker-Parks Ashley
  ;
run;

data want;
  set have;
  if countc(name,'-') eq 1 then do;
      if countc(name,',') eq 1 and countc(scan(name,1,','),'-') eq 0 
          then name = catx(', ',scan(name,2,','),scan(name,1,','));
        else if countc(name,',') eq 0 and countc(scan(name,2,' '),'-') eq 0 
            then name = catx(' ',scan(name,2,' '),scan(name,1,' '));
    end;
run;
Walternate
Obsidian | Level 7

That worked, thank you!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 960 views
  • 1 like
  • 2 in conversation