Solved: deleting mixed duplicates

jfaruqui · Posted 11-05-2018 02:13 PM

I have a dataset which has a number of variables including NAME.

I am trying to delete duplicate observations of name where one observation would be 'John Smith' and another observation would be 'smith john' .. they are clearly the same person and i want to delete the duplicate entry .. what would be the most efficient way to do it ?

considering also that the duplicate names could occur anywhere within the dataset.

Ex:

John Smith

Cal Harper

freddy Holt

smith john

frank waters

harper Cal

novinosrin · Posted 11-05-2018 02:42 PM

@jfaruqui Ok Lets go linear


data have;
input name $50.;
cards;
John Smith
Cal Harper
freddy Holt
smith john
frank waters
harper Cal
;
run;


data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
n=compress(upcase(name));
do _n_=1 to length(n);
t(_n_)=char(n,_n_);
end;
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w n) nodupkey;
by w;
run;

View solution in original post

novinosrin · Posted 11-05-2018 02:19 PM

I responded the same question in your other thread

https://communities.sas.com/t5/SAS-Programming/deleting-duplicates/m-p/510553/highlight/false#M13739...

data have;
input name $50.;
cards;
John Smith
Cal Harper
freddy Holt
smith john
frank waters
harper Cal
;
run;

data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
call pokelong(compress(upcase(name)),addrlong(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run;

novinosrin · Posted 11-05-2018 02:22 PM

I don't know how to merge the threads although I could request @Reeza / @Kurt_Bremser to help merge the duplicate threads

Going forward, Kindly edit in the same thread you started plz

jfaruqui · Posted 11-05-2018 02:26 PM

thanks bro .. sorry for the duplicate threads .. unfortunately I can't test your code on my dataset in SAS University Edition as I am getting these error messages:

ERROR: The function POKELONG cannot be invoked when SAS is in the lockdown state.
ERROR: The function ADDRLONG cannot be invoked when SAS is in the lockdown state.
ERROR 251-185: The subroutine POKELONG is unknown, or cannot be accessed. Check your spelling.
Either it was not found in the path(s) of executable images, or there was incorrect or missing subroutine descriptor
information.

ERROR 68-185: The function ADDRLONG is unknown, or cannot be accessed.

But i am sure this solution would give the required result ... Thanks again bro !!

novinosrin · Posted 11-05-2018 02:28 PM

Hang on, if you are new/relatively new to SAS let alone APP, I beg your pardon, ignore the use of APP data management functions.

jfaruqui · Posted 11-05-2018 02:32 PM

So new/relatively new that this is first time I heard about APP functions .. fascinating stuff though .. reading about it just now ..

novinosrin · Posted 11-05-2018 02:33 PM

Ok, just try the 32 bit version-->

data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
call poke(compress(upcase(name)),addr(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run;

Test this and see if this works

jfaruqui · Posted 11-05-2018 02:39 PM

Sorry Buddy 🙂
Same errors as before ..
I guess SAS University Edition is quite restrictive and doesn't allow access to memory .. POKE won't work either ... but I shall persevere .. highly obliged and grateful for your time and engagement to my query ..

novinosrin · Posted 11-05-2018 02:42 PM

@jfaruqui Ok Lets go linear


data have;
input name $50.;
cards;
John Smith
Cal Harper
freddy Holt
smith john
frank waters
harper Cal
;
run;


data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
n=compress(upcase(name));
do _n_=1 to length(n);
t(_n_)=char(n,_n_);
end;
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w n) nodupkey;
by w;
run;

jfaruqui · Posted 11-05-2018 02:47 PM

BINGO !!
Awesome and thank you so very much !!

jfaruqui · Posted 11-05-2018 01:50 PM

I have a dataset which has a number of variables including NAME.

I am trying to delete duplicate observations of name where one observation would be 'John Smith' and another observation would be 'smith john' .. they are clearly the same person and i want to delete the duplicate entry .. what would be the most efficient way to do it ?

PeterClemmensen · Posted 11-05-2018 01:56 PM

How do you determine a duplicate then? What about "Johnn Smith"?

Perhaps the COMPLEV or COMPGED Function can be of help. These compute 'distances' between strings.

The DATA to DATA Step Macro
Blog: SASnrd

novinosrin · Posted 11-05-2018 02:06 PM

data have;
length name $50;
name='smith john';
output;
name='John smith';
output;
name='Mcdonald John';
output;
name='John Mcdonald';
output;
run;

data t;
set have;
array t(50) $1 ;
call pokelong(compress(upcase(name)),addrlong(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
drop t:;
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run;

novinosrin · Posted 11-05-2018 02:15 PM

With temporary array,

data t;
set have;
array t(50) $1 _temporary_;
call missing(of t(*));
call pokelong(compress(upcase(name)),addrlong(t(1)),50);
call sortc(of t(*));
w=cats(of t(*));
run;
proc sort data=t out=want(drop=w) nodupkey;
by w;
run;

jfaruqui · Posted 11-05-2018 02:19 PM

would this work even if the duplicates were far apart within the dataset ?? separated by many unique observations ?

deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

Re: deleting mixed duplicates

deleting duplicates

Re: deleting duplicates

Re: deleting duplicates

Re: deleting duplicates

Re: deleting duplicates

Registration is open

SAS Training: Just a Click Away