topic Re: elimination of duplicate rows in SAS Data Science

elimination of duplicate rows

Cyndia — Fri, 20 Apr 2012 17:20:10 GMT

what is the best way to eliminate the duplicate records when there is a dataset like this:

firstname middle lastname ID age school city

Debbie R Popular 12 13 xyz NY

Debbie R Popular 21 13 xyz NY

Debbie R Popular 12 13 xyz NY

Deb R Popular 12 13 xyz NY

note that only first and third records are duplicate, others are not the same. we only want to eliminate either the first or the third record in this case.

there is a distinct( ) in SQL, in SAS, is there a way to eliminate only the records that are duplicate in all columns? thanks.

Re: elimination of duplicate rows

Linlin — Fri, 20 Apr 2012 17:28:35 GMT

example:

data have;

input a b c;

cards;

1 2 3

3 2 1

1 2 3

;

proc sql;

create table want as

select distinct * from have;

quit;

proc print ;run;

Re: elimination of duplicate rows

Haikuo — Fri, 20 Apr 2012 17:28:50 GMT

Use nodupkey in proc sort:

data have;

input (firstname middle lastname ID) (:$10.) age (school city) (:$10.);

cards;

Debbie R Popular 12 13 xyz NY

Debbie R Popular 21 13 xyz NY

Debbie R Popular 12 13 xyz NY

Deb R Popular 12 13 xyz NY

;

proc sort data=have out=want nodupkey;

by firstname middle lastname ID age school city;

run;

proc print;run;

07:05 Friday, April 20, 2012 67

Obs firstname middle lastname ID age school city

1 Deb R Popular 12 13 xyz NY

2 Debbie R Popular 12 13 xyz NY

3 Debbie R Popular 21 13 xyz NY

Haikuo

Update: please be aware that in this case, 'nodup' option will not work, as the duplicated records are not adjacent.

Re: elimination of duplicate rows

Haikuo — Fri, 20 Apr 2012 17:33:39 GMT

Or just using data step:

data have;

input (firstname middle lastname ID) (:$10.) age (school city) (:$10.);

cards;

Debbie R Popular 12 13 xyz NY

Debbie R Popular 21 13 xyz NY

Debbie R Popular 12 13 xyz NY

Deb R Popular 12 13 xyz NY

;

data _null_;

if 0 then set have;

dcl hash h(dataset:'have');

h.definekey(all:'y');

h.definedata(all:'y');

h.definedone();

h.output(dataset: 'want');

run;

proc print;run;

07:05 Friday, April 20, 2012 68

Obs firstname middle lastname ID age school city

1 Debbie R Popular 12 13 xyz NY

2 Deb R Popular 12 13 xyz NY

3 Debbie R Popular 21 13 xyz NY

Haikuo

Re: elimination of duplicate rows

MikeZdeb — Fri, 20 Apr 2012 20:14:16 GMT

hi ...

re: "Update: please be aware that in this case, 'nodup' option will not work, as the duplicated records are not adjacent"

if you sort by all the variables, NODUP and NODUPKEY give the same result

also, if you want to sort by all the variables, you can use _ALL_ ...

proc sort data=have out=want nodup;

by _all_;

run;