topic Re: reduce output data in SAS Programming

reduce output data

GeorgeSAS — Mon, 09 Dec 2019 16:41:23 GMT

Hello everyone,

I want to remove duplicates and save only first 10 duplicates to a new dataset. is that possible?

proc sort data=have nodupkey out=nodup dupout= dups10(outobs=10 ) ;
by id1 id2;
run;
*outobs=10 option does not work;

Thanks!

Re: reduce output data

DarthPathos — Mon, 09 Dec 2019 16:50:15 GMT

Hi @GeorgeSAS

You can do it using PROC SQL; the where statement will have whatever criteria you want applied. You can do it through the DATA step, I'm just more comfortable using SQL.

proc sql outobs=10;
select distinct table_name.variable_name
from table_name
where ….;
quit;

Chris

Re: reduce output data

Tom — Mon, 09 Dec 2019 18:17:07 GMT

Sounds like you want to do this:

proc sort data=have out=step1;
  by id1 id2;
run;
data nodup dups10;
   set step1;
   by id1 id2;
   if first.id2 then output nodup;
   else if _ndups < 10 then do;
      _ndups+1;
      output dups10;
   end;
  drop _ndups;
run;
proc delete data=step1; 
run;

Re: reduce output data

SuryaKiran — Mon, 09 Dec 2019 20:30:22 GMT

Hello,

You can use OBS= on the output table being created, OBS= is valid only when an existing SAS data set is read. OUTOBS is only used in PROC SQL.

This might work.

proc sort data=sashelp.class out=nodups dupout=dups nodupkey;
by sex;
run;

data dups;
set dups(obs=10);
run;

Re: reduce output data

mkeintz — Mon, 09 Dec 2019 22:23:00 GMT

As you've discovered, you can't put a parameter in to restrict the count of observations in DUPS10 (or generally for any output dataset). Of course you could run a second data step, as @SuryaKiran demonstrated. That technique will give the first SORTED 10.

There is a way to do the complete task in a single data step, with the usage of two hash objects: one (named SRTED below) will yield sorted have, with no duplicate keys, and the second (NAM) will provide the exact count of duplicates you want:

data have;
  set sashelp.class;
  ran=ranuni(012498105);
  output;
  ran=ranuni(049810444);
  output;
  ran=ranuni(0259866);
  output;
run;
proc sort data=have out=have (drop=ran);
  by ran;
run;

data duplicates (drop=_:);
  set have;
  if _n_=1 then do;
    declare hash srted(dataset:'have',ordered:'A');
      srted.definekey('name');
      srted.definedata(all:'Y');
      srted.definedone();
      srted.output(dataset:'have_sorted');
     declare hash nam();
       nam.definekey('name');
       nam.definedone();
  end;
  if nam.find()^=0 then nam.add();
  else do;
    output;
    _ndupes+1;
    if _ndupes>=10 then stop;
  end;
run;

The resulting data set HAVE_SORTED will have only one record per name - because that is a default property of sas hash objects. It will have the record containing first instance of each sort key - exactly as the NODUPKEY option in PROC SORT.

The NAM hash is used to trace whether an incoming record (from the original HAVE dataset) has been encountered before. If it has, then is it a duplicate - to be output to DUPLICATES and counted in _NDUPES. Note these duplicates will not be in sorted order, since they are based on processing the original unsorted HAVE.