Hello everyone,
I want to remove duplicates and save only first 10 duplicates to a new dataset. is that possible?
proc sort data=have nodupkey out=nodup dupout= dups10(outobs=10 ) ;
by id1 id2;
run;
*outobs=10 option does not work;
Thanks!
Hi @GeorgeSAS
You can do it using PROC SQL; the where statement will have whatever criteria you want applied. You can do it through the DATA step, I'm just more comfortable using SQL.
proc sql outobs=10;
select distinct table_name.variable_name
from table_name
where ….;
quit;
Chris
Sounds like you want to do this:
proc sort data=have out=step1;
by id1 id2;
run;
data nodup dups10;
set step1;
by id1 id2;
if first.id2 then output nodup;
else if _ndups < 10 then do;
_ndups+1;
output dups10;
end;
drop _ndups;
run;
proc delete data=step1;
run;
Hello,
You can use OBS= on the output table being created, OBS= is valid only when an existing SAS data set is read. OUTOBS is only used in PROC SQL.
This might work.
proc sort data=sashelp.class out=nodups dupout=dups nodupkey;
by sex;
run;
data dups;
set dups(obs=10);
run;
As you've discovered, you can't put a parameter in to restrict the count of observations in DUPS10 (or generally for any output dataset). Of course you could run a second data step, as @SuryaKiran demonstrated. That technique will give the first SORTED 10.
There is a way to do the complete task in a single data step, with the usage of two hash objects: one (named SRTED below) will yield sorted have, with no duplicate keys, and the second (NAM) will provide the exact count of duplicates you want:
data have;
set sashelp.class;
ran=ranuni(012498105);
output;
ran=ranuni(049810444);
output;
ran=ranuni(0259866);
output;
run;
proc sort data=have out=have (drop=ran);
by ran;
run;
data duplicates (drop=_:);
set have;
if _n_=1 then do;
declare hash srted(dataset:'have',ordered:'A');
srted.definekey('name');
srted.definedata(all:'Y');
srted.definedone();
srted.output(dataset:'have_sorted');
declare hash nam();
nam.definekey('name');
nam.definedone();
end;
if nam.find()^=0 then nam.add();
else do;
output;
_ndupes+1;
if _ndupes>=10 then stop;
end;
run;
The resulting data set HAVE_SORTED will have only one record per name - because that is a default property of sas hash objects. It will have the record containing first instance of each sort key - exactly as the NODUPKEY option in PROC SORT.
The NAM hash is used to trace whether an incoming record (from the original HAVE dataset) has been encountered before. If it has, then is it a duplicate - to be output to DUPLICATES and counted in _NDUPES. Note these duplicates will not be in sorted order, since they are based on processing the original unsorted HAVE.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.