what is the best way to eliminate the duplicate records when there is a dataset like this:
firstname middle lastname ID age school city
Debbie R Popular 12 13 xyz NY
Debbie R Popular 21 13 xyz NY
Debbie R Popular 12 13 xyz NY
Deb R Popular 12 13 xyz NY
note that only first and third records are duplicate, others are not the same. we only want to eliminate either the first or the third record in this case.
there is a distinct( ) in SQL, in SAS, is there a way to eliminate only the records that are duplicate in all columns? thanks.
example:
data have;
input a b c;
cards;
1 2 3
3 2 1
1 2 3
;
proc sql;
create table want as
select distinct * from have;
quit;
proc print ;run;
example:
data have;
input a b c;
cards;
1 2 3
3 2 1
1 2 3
;
proc sql;
create table want as
select distinct * from have;
quit;
proc print ;run;
Use nodupkey in proc sort:
data have;
input (firstname middle lastname ID) (:$10.) age (school city) (:$10.);
cards;
Debbie R Popular 12 13 xyz NY
Debbie R Popular 21 13 xyz NY
Debbie R Popular 12 13 xyz NY
Deb R Popular 12 13 xyz NY
;
proc sort data=have out=want nodupkey;
by firstname middle lastname ID age school city;
run;
proc print;run;
07:05 Friday, April 20, 2012 67
Obs firstname middle lastname ID age school city
1 Deb R Popular 12 13 xyz NY
2 Debbie R Popular 12 13 xyz NY
3 Debbie R Popular 21 13 xyz NY
Haikuo
Update: please be aware that in this case, 'nodup' option will not work, as the duplicated records are not adjacent.
Or just using data step:
data have;
input (firstname middle lastname ID) (:$10.) age (school city) (:$10.);
cards;
Debbie R Popular 12 13 xyz NY
Debbie R Popular 21 13 xyz NY
Debbie R Popular 12 13 xyz NY
Deb R Popular 12 13 xyz NY
;
data _null_;
if 0 then set have;
dcl hash h(dataset:'have');
h.definekey(all:'y');
h.definedata(all:'y');
h.definedone();
h.output(dataset: 'want');
run;
proc print;run;
07:05 Friday, April 20, 2012 68
Obs firstname middle lastname ID age school city
1 Debbie R Popular 12 13 xyz NY
2 Deb R Popular 12 13 xyz NY
3 Debbie R Popular 21 13 xyz NY
Haikuo
hi ...
re: "Update: please be aware that in this case, 'nodup' option will not work, as the duplicated records are not adjacent"
if you sort by all the variables, NODUP and NODUPKEY give the same result
also, if you want to sort by all the variables, you can use _ALL_ ...
proc sort data=have out=want nodup;
by _all_;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.