reduce output data

GeorgeSAS · Posted 12-09-2019 11:41 AM

Hello everyone,

I want to remove duplicates and save only first 10 duplicates to a new dataset. is that possible?

proc sort data=have nodupkey out=nodup dupout= dups10(outobs=10 ) ;
by id1 id2;
run;
*outobs=10 option does not work;

Thanks!

DarthPathos · Posted 12-09-2019 11:50 AM

Hi @GeorgeSAS

You can do it using PROC SQL; the where statement will have whatever criteria you want applied. You can do it through the DATA step, I'm just more comfortable using SQL.

proc sql outobs=10;
select distinct table_name.variable_name
from table_name
where ….;
quit;

Chris

Has my article or post helped? Please mark as Solution or Like the article!

Tom · Posted 12-09-2019 11:57 AM

Sounds like you want to do this:

proc sort data=have out=step1;
  by id1 id2;
run;
data nodup dups10;
   set step1;
   by id1 id2;
   if first.id2 then output nodup;
   else if _ndups < 10 then do;
      _ndups+1;
      output dups10;
   end;
  drop _ndups;
run;
proc delete data=step1; 
run;

SuryaKiran · Posted 12-09-2019 03:30 PM

Hello,

You can use OBS= on the output table being created, OBS= is valid only when an existing SAS data set is read. OUTOBS is only used in PROC SQL.

This might work.

proc sort data=sashelp.class out=nodups dupout=dups nodupkey;
by sex;
run;

data dups;
set dups(obs=10);
run;

Thanks,
Suryakiran

mkeintz · Posted 12-09-2019 04:48 PM

As you've discovered, you can't put a parameter in to restrict the count of observations in DUPS10 (or generally for any output dataset). Of course you could run a second data step, as @SuryaKiran demonstrated. That technique will give the first SORTED 10.

There is a way to do the complete task in a single data step, with the usage of two hash objects: one (named SRTED below) will yield sorted have, with no duplicate keys, and the second (NAM) will provide the exact count of duplicates you want:

data have;
  set sashelp.class;
  ran=ranuni(012498105);
  output;
  ran=ranuni(049810444);
  output;
  ran=ranuni(0259866);
  output;
run;
proc sort data=have out=have (drop=ran);
  by ran;
run;

data duplicates (drop=_:);
  set have;
  if _n_=1 then do;
    declare hash srted(dataset:'have',ordered:'A');
      srted.definekey('name');
      srted.definedata(all:'Y');
      srted.definedone();
      srted.output(dataset:'have_sorted');
     declare hash nam();
       nam.definekey('name');
       nam.definedone();
  end;
  if nam.find()^=0 then nam.add();
  else do;
    output;
    _ndupes+1;
    if _ndupes>=10 then stop;
  end;
run;

The resulting data set HAVE_SORTED will have only one record per name - because that is a default property of sas hash objects. It will have the record containing first instance of each sort key - exactly as the NODUPKEY option in PROC SORT.

The NAM hash is used to trace whether an incoming record (from the original HAVE dataset) has been encountered before. If it has, then is it a duplicate - to be output to DUPLICATES and counted in _NDUPES. Note these duplicates will not be in sorted order, since they are based on processing the original unsorted HAVE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

reduce output data

Re: reduce output data

Re: reduce output data

Re: reduce output data

Re: reduce output data

reduce output data

Re: reduce output data

Re: reduce output data

Re: reduce output data

Re: reduce output data

Registration is open

SAS Training: Just a Click Away