what is the best way to eliminate the duplicate records when there is a dataset like this:
firstname middle lastname ID age school city
Debbie R Popular 12 13 xyz NY
Debbie R Popular 21 13 xyz NY
Debbie R Popular 12 13 xyz NY
Deb R Popular 12 13 xyz NY
note that only first and third records are duplicate, others are not the same. we only want to eliminate either the first or the third record in this case.
there is a distinct( ) in SQL, in SAS, is there a way to eliminate only the records that are duplicate in all columns? thanks.
example:
data have;
input a b c;
cards;
1 2 3
3 2 1
1 2 3
;
proc sql;
create table want as
select distinct * from have;
quit;
proc print ;run;
example:
data have;
input a b c;
cards;
1 2 3
3 2 1
1 2 3
;
proc sql;
create table want as
select distinct * from have;
quit;
proc print ;run;
Use nodupkey in proc sort:
data have;
input (firstname middle lastname ID) (:$10.) age (school city) (:$10.);
cards;
Debbie R Popular 12 13 xyz NY
Debbie R Popular 21 13 xyz NY
Debbie R Popular 12 13 xyz NY
Deb R Popular 12 13 xyz NY
;
proc sort data=have out=want nodupkey;
by firstname middle lastname ID age school city;
run;
proc print;run;
07:05 Friday, April 20, 2012 67
Obs firstname middle lastname ID age school city
1 Deb R Popular 12 13 xyz NY
2 Debbie R Popular 12 13 xyz NY
3 Debbie R Popular 21 13 xyz NY
Haikuo
Update: please be aware that in this case, 'nodup' option will not work, as the duplicated records are not adjacent.
Or just using data step:
data have;
input (firstname middle lastname ID) (:$10.) age (school city) (:$10.);
cards;
Debbie R Popular 12 13 xyz NY
Debbie R Popular 21 13 xyz NY
Debbie R Popular 12 13 xyz NY
Deb R Popular 12 13 xyz NY
;
data _null_;
if 0 then set have;
dcl hash h(dataset:'have');
h.definekey(all:'y');
h.definedata(all:'y');
h.definedone();
h.output(dataset: 'want');
run;
proc print;run;
07:05 Friday, April 20, 2012 68
Obs firstname middle lastname ID age school city
1 Debbie R Popular 12 13 xyz NY
2 Deb R Popular 12 13 xyz NY
3 Debbie R Popular 21 13 xyz NY
Haikuo
hi ...
re: "Update: please be aware that in this case, 'nodup' option will not work, as the duplicated records are not adjacent"
if you sort by all the variables, NODUP and NODUPKEY give the same result
also, if you want to sort by all the variables, you can use _ALL_ ...
proc sort data=have out=want nodup;
by _all_;
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.