Solved: Removing duplicate rows with case sensitive values

amamiche67 · Posted 03-30-2023 08:12 PM

I have a dataset with duplicate names, yet same IDs. It looks like this. I received this dataset as is, so not sure why the names are case sensitive.

ID	FirstName	LastName
001	John	Reed
001	JOHN	REED
002	Mitchel	James
003	Steph	Ania
004	King	Mon
004	KING	MON
005	Valery	Short
012	ALMA	JACOBS
012	Alma	Jacobs
017	Josh	Ryan
017	Jash	Ryan

I need to identify those with the exact same spelling for firstname and lastname variables to remove the duplicates. I want the final dataset to look like this table below. However, given that I have around 800 rows and the names are duplicates due to being case sensitive (except for ID 017, which is a duplicate due to a typo), I'm unsure of how to approach this. Does anyone have any suggestions and/or solutions?

ID	FirstName	LastName
001	John	Reed
002	Mitchel	James
003	Steph	Ania
004	King	Mon
005	Valery	Short
012	Alma	Jacobs
017	Josh	Ryan
017	Jash	Ryan

SASKiwi · Posted 03-30-2023 09:03 PM

Something like this?

data Have;
  input @1 ID $3. @5 FirstName $8. @12 LastName $8.;
  FirstName = propcase(FirstName);
  LastName = propcase(LastName);
datalines;
001 John    Reed
001 JOHN    REED
002 Mitchel James
003 Steph   Ania
004 King    Mon
004 KING    MON
005 Valery  Short
012 ALMA    JACOBS
012 Alma    Jacobs
017 Josh    Ryan
017 Jash    Ryan
;
run;

proc sort data = Have 
          out  = Want nodupkey;
  by ID LastName FirstName;
run;

View solution in original post

SASKiwi · Posted 03-30-2023 09:03 PM

Something like this?

data Have;
  input @1 ID $3. @5 FirstName $8. @12 LastName $8.;
  FirstName = propcase(FirstName);
  LastName = propcase(LastName);
datalines;
001 John    Reed
001 JOHN    REED
002 Mitchel James
003 Steph   Ania
004 King    Mon
004 KING    MON
005 Valery  Short
012 ALMA    JACOBS
012 Alma    Jacobs
017 Josh    Ryan
017 Jash    Ryan
;
run;

proc sort data = Have 
          out  = Want nodupkey;
  by ID LastName FirstName;
run;

SimonDawson · Posted 03-30-2023 09:17 PM

SASKiwis idea will work if you don't need to retain the first case in the data. Here is what I cam up with. This uses a hash object so will only scale to the size of the available memory on your system.

data have;
  length ID 8 FirstName LastName $8;
  input ID FirstName LastName;
cards;
001 John Reed
001 JOHN REED
002 Mitchel James
003 Steph Ania
004 King Mon
004 KING MON
005 Valery Short
012 ALMA JACOBS
012 Alma Jacobs
017 Josh Ryan
017 Jash Ryan
;
run;

data want;
    if _n_ = 1 then do;
        declare hash h();
        rc = h.defineKey('ID','FirstName_upcase','LastName_upcase');
        rc = h.defineDone();
    end;

    set have;

    FirstName_upcase = upcase(FirstName);
    LastName_upcase = upcase(LastName);

    rc = h.find();
    if (rc ~= 0) then do;
        rc = h.add();
        output;
    end;

  keep ID FirstName LastName;

run;

This will keep the first observation with unique spelling and retain the case. If the varaible was "JoHn" in the first observation the propcase method would make this "John".

Removing duplicate rows with case sensitive values

Re: Removing duplicate rows with case sensitive values

Re: Removing duplicate rows with case sensitive values

Re: Removing duplicate rows with case sensitive values

Removing duplicate rows with case sensitive values

Re: Removing duplicate rows with case sensitive values

Re: Removing duplicate rows with case sensitive values

Re: Removing duplicate rows with case sensitive values

Registration is open

SAS Training: Just a Click Away