BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jfaruqui
Obsidian | Level 7

I have a dataset which has a number of variables including NAME.

I am trying to delete duplicate observations of name where one observation would be 'John Smith' and another observation would be 'smith john' .. they are clearly the same person and i want to delete the duplicate entry .. what would be the most efficient way to do it ?

considering also that the duplicate names could occur anywhere within the dataset.

 

Ex:

John Smith

Cal Harper

freddy Holt

smith john

frank waters

harper Cal

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

@jfaruqui  Ok Lets go linear 

 


data have;
input name $50.;
cards;
John Smith
Cal Harper
freddy Holt
smith john
frank waters
harper Cal
;
run;


data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
n=compress(upcase(name));
do _n_=1 to length(n);
t(_n_)=char(n,_n_);
end;
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w n) nodupkey;
by w;
run; 

View solution in original post

24 REPLIES 24
novinosrin
Tourmaline | Level 20

I responded the same question in your other thread

https://communities.sas.com/t5/SAS-Programming/deleting-duplicates/m-p/510553/highlight/false#M13739...

 

data have;
input name $50.;
cards;
John Smith
Cal Harper
freddy Holt
smith john
frank waters
harper Cal
;
run;

data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
call pokelong(compress(upcase(name)),addrlong(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run; 
novinosrin
Tourmaline | Level 20

I don't know how to merge the threads although I could request @Reeza / @Kurt_Bremser to help merge the duplicate threads

 

Going forward, Kindly edit in the same thread you started plz

jfaruqui
Obsidian | Level 7
thanks bro .. sorry for the duplicate threads .. unfortunately I can't test your code on my dataset in SAS University Edition as I am getting these error messages:

ERROR: The function POKELONG cannot be invoked when SAS is in the lockdown state.
ERROR: The function ADDRLONG cannot be invoked when SAS is in the lockdown state.
ERROR 251-185: The subroutine POKELONG is unknown, or cannot be accessed. Check your spelling.
Either it was not found in the path(s) of executable images, or there was incorrect or missing subroutine descriptor
information.

ERROR 68-185: The function ADDRLONG is unknown, or cannot be accessed.

But i am sure this solution would give the required result ... Thanks again bro !!
novinosrin
Tourmaline | Level 20

Hang on, if you are new/relatively new to SAS let alone APP, I beg your pardon, ignore the use of APP data management functions.

jfaruqui
Obsidian | Level 7
So new/relatively new that this is first time I heard about APP functions .. fascinating stuff though .. reading about it just now ..
novinosrin
Tourmaline | Level 20

Ok, just try the 32 bit version-->

data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
call poke(compress(upcase(name)),addr(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run; 

 

 Test this and see if this works

jfaruqui
Obsidian | Level 7
Sorry Buddy 🙂
Same errors as before ..
I guess SAS University Edition is quite restrictive and doesn't allow access to memory .. POKE won't work either ... but I shall persevere .. highly obliged and grateful for your time and engagement to my query ..
novinosrin
Tourmaline | Level 20

@jfaruqui  Ok Lets go linear 

 


data have;
input name $50.;
cards;
John Smith
Cal Harper
freddy Holt
smith john
frank waters
harper Cal
;
run;


data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
n=compress(upcase(name));
do _n_=1 to length(n);
t(_n_)=char(n,_n_);
end;
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w n) nodupkey;
by w;
run; 
jfaruqui
Obsidian | Level 7
BINGO !!
Awesome and thank you so very much !!
jfaruqui
Obsidian | Level 7

I have a dataset which has a number of variables including NAME.

I am trying to delete duplicate observations of name where one observation would be 'John Smith' and another observation would be 'smith john' .. they are clearly the same person and i want to delete the duplicate entry .. what would be the most efficient way to do it ?

PeterClemmensen
Tourmaline | Level 20

How do you determine a duplicate then? What about "Johnn Smith"?

 

Perhaps the COMPLEV or COMPGED Function can be of help. These compute 'distances' between strings.

novinosrin
Tourmaline | Level 20
data have;
length name $50;
name='smith john';
output;
name='John smith';
output;
name='Mcdonald John';
output;
name='John Mcdonald';
output;
run;

data t;
set have;
array t(50) $1 ;
call pokelong(compress(upcase(name)),addrlong(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
drop t:;
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run; 
novinosrin
Tourmaline | Level 20

With temporary array,

 

data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
call pokelong(compress(upcase(name)),addrlong(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run; 
jfaruqui
Obsidian | Level 7
would this work even if the duplicates were far apart within the dataset ?? separated by many unique observations ?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 3371 views
  • 7 likes
  • 6 in conversation