BookmarkSubscribeRSS Feed
GeorgeSAS
Lapis Lazuli | Level 10

Hello everyone,

 

I want to remove duplicates and save only first 10 duplicates to a new dataset. is that possible?

proc sort data=have nodupkey out=nodup dupout= dups10(outobs=10 ) ;
by id1 id2;
run;
*outobs=10 option does not work;

 

Thanks!

4 REPLIES 4
DarthPathos
Lapis Lazuli | Level 10

Hi @GeorgeSAS 

 

You can do it using PROC SQL; the where statement will have whatever criteria you want applied.  You can do it through the DATA step, I'm just more comfortable using SQL. 

 

proc sql outobs=10;
select distinct table_name.variable_name
from table_name
where ….;
quit;

Chris

Has my article or post helped? Please mark as Solution or Like the article!
Tom
Super User Tom
Super User

Sounds like you want to do this:

proc sort data=have out=step1;
  by id1 id2;
run;
data nodup dups10;
   set step1;
   by id1 id2;
   if first.id2 then output nodup;
   else if _ndups < 10 then do;
      _ndups+1;
      output dups10;
   end;
  drop _ndups;
run;
proc delete data=step1; 
run;
SuryaKiran
Meteorite | Level 14

Hello,

 

You can use OBS= on the output table being created, OBS= is valid only when an existing SAS data set is read. OUTOBS is only used in PROC SQL.

This might work.

proc sort data=sashelp.class out=nodups dupout=dups nodupkey;
by sex;
run;

data dups;
set dups(obs=10);
run;

 

 

Thanks,
Suryakiran
mkeintz
PROC Star

As you've discovered, you can't put a parameter in to restrict the count of observations in DUPS10 (or generally for any output dataset).  Of course you could run a second data step, as @SuryaKiran demonstrated.  That technique will give the first SORTED 10.

 

 

There is a way to do the complete task in a single data step, with the usage of two hash objects: one (named SRTED below) will yield  sorted have, with no duplicate keys, and the second (NAM) will provide the exact count of duplicates you want:

 

data have;
  set sashelp.class;
  ran=ranuni(012498105);
  output;
  ran=ranuni(049810444);
  output;
  ran=ranuni(0259866);
  output;
run;
proc sort data=have out=have (drop=ran);
  by ran;
run;

data duplicates (drop=_:);
  set have;
  if _n_=1 then do;
    declare hash srted(dataset:'have',ordered:'A');
      srted.definekey('name');
      srted.definedata(all:'Y');
      srted.definedone();
      srted.output(dataset:'have_sorted');
     declare hash nam();
       nam.definekey('name');
       nam.definedone();
  end;
  if nam.find()^=0 then nam.add();
  else do;
    output;
    _ndupes+1;
    if _ndupes>=10 then stop;
  end;
run;

The resulting data set HAVE_SORTED will have only one record per name - because that is a default property of sas hash objects.  It will have the record containing first instance of each sort key - exactly as the NODUPKEY option in PROC SORT.

 

The NAM hash is used to trace whether an incoming record (from the original HAVE dataset) has been encountered before.  If it has, then is it a duplicate - to be output to DUPLICATES and counted in _NDUPES.  Note these duplicates will not be in sorted order, since they are based on processing the original unsorted HAVE.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2067 views
  • 1 like
  • 5 in conversation