I have a dataset with duplicate names, yet same IDs. It looks like this. I received this dataset as is, so not sure why the names are case sensitive.
ID | FirstName | LastName |
001 | John | Reed |
001 | JOHN | REED |
002 | Mitchel | James |
003 | Steph | Ania |
004 | King | Mon |
004 | KING | MON |
005 | Valery | Short |
012 | ALMA | JACOBS |
012 | Alma | Jacobs |
017 | Josh | Ryan |
017 | Jash | Ryan |
I need to identify those with the exact same spelling for firstname and lastname variables to remove the duplicates. I want the final dataset to look like this table below. However, given that I have around 800 rows and the names are duplicates due to being case sensitive (except for ID 017, which is a duplicate due to a typo), I'm unsure of how to approach this. Does anyone have any suggestions and/or solutions?
ID | FirstName | LastName |
001 | John | Reed |
002 | Mitchel | James |
003 | Steph | Ania |
004 | King | Mon |
005 | Valery | Short |
012 | Alma | Jacobs |
017 | Josh | Ryan |
017 | Jash | Ryan |
Something like this?
data Have;
input @1 ID $3. @5 FirstName $8. @12 LastName $8.;
FirstName = propcase(FirstName);
LastName = propcase(LastName);
datalines;
001 John Reed
001 JOHN REED
002 Mitchel James
003 Steph Ania
004 King Mon
004 KING MON
005 Valery Short
012 ALMA JACOBS
012 Alma Jacobs
017 Josh Ryan
017 Jash Ryan
;
run;
proc sort data = Have
out = Want nodupkey;
by ID LastName FirstName;
run;
Something like this?
data Have;
input @1 ID $3. @5 FirstName $8. @12 LastName $8.;
FirstName = propcase(FirstName);
LastName = propcase(LastName);
datalines;
001 John Reed
001 JOHN REED
002 Mitchel James
003 Steph Ania
004 King Mon
004 KING MON
005 Valery Short
012 ALMA JACOBS
012 Alma Jacobs
017 Josh Ryan
017 Jash Ryan
;
run;
proc sort data = Have
out = Want nodupkey;
by ID LastName FirstName;
run;
SASKiwis idea will work if you don't need to retain the first case in the data. Here is what I cam up with. This uses a hash object so will only scale to the size of the available memory on your system.
data have;
length ID 8 FirstName LastName $8;
input ID FirstName LastName;
cards;
001 John Reed
001 JOHN REED
002 Mitchel James
003 Steph Ania
004 King Mon
004 KING MON
005 Valery Short
012 ALMA JACOBS
012 Alma Jacobs
017 Josh Ryan
017 Jash Ryan
;
run;
data want;
if _n_ = 1 then do;
declare hash h();
rc = h.defineKey('ID','FirstName_upcase','LastName_upcase');
rc = h.defineDone();
end;
set have;
FirstName_upcase = upcase(FirstName);
LastName_upcase = upcase(LastName);
rc = h.find();
if (rc ~= 0) then do;
rc = h.add();
output;
end;
keep ID FirstName LastName;
run;
This will keep the first observation with unique spelling and retain the case. If the varaible was "JoHn" in the first observation the propcase method would make this "John".
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.