topic Re: eliminating duplicates in a data step in SAS Data Management

eliminating duplicates in a data step

robm — Thu, 19 Oct 2017 06:01:53 GMT

I have a data step like this that creates a 4gb file , unfortunately it has duplicates that would reduce it too 150,000kb is there an equivalent of “distinct” for the sas syntax below.

  data ROBM.LIMITFILE1;
    set _egimle.&timeperiodvalue (keep=&limitvar1field &limitvar1fielddesc &limitvar2field &limitvar2fielddesc &limitvar3field &limitvar3fielddesc &limitvar4field
                                        &limitvar4fielddesc &limitvar5field &limitvar5fielddesc &limitvar6field &limitvar6fielddesc &limitvar7field &limitvar7fielddesc
                                        &limitvar8field &limitvar8fielddesc filter=("&trendvalue" AND "&timeperiodfilter"));

Re: eliminating duplicates in a data step

Kurt_Bremser — Thu, 19 Oct 2017 07:05:58 GMT

Filtering duplicates always involves sorting by the criteria that define the duplicates.

You can either use proc sort with nodupkey, or proc sql with select distinct.

proc sort can be more efficient, as select distinct is equivalent to the noduprec option in proc sort.

If your source dataset is already sorted by the criteria mentioned above, you can use by processing and first. or last. to only accept a single observation per by group.

Re: eliminating duplicates in a data step

Kurt_Bremser — Thu, 19 Oct 2017 07:09:05 GMT

BTW there is no filter= dataset option in SAS.

And compare this code with yours:

data ROBM.LIMITFILE1;
set _egimle.&timeperiodvalue (
  keep=
    &limitvar1field &limitvar1fielddesc
    &limitvar2field &limitvar2fielddesc
    &limitvar3field &limitvar3fielddesc
    &limitvar4field &limitvar4fielddesc
    &limitvar5field &limitvar5fielddesc 
    &limitvar6field &limitvar6fielddesc
    &limitvar7field &limitvar7fielddesc
    &limitvar8field &limitvar8fielddesc
  where=(trend="&trendvalue" AND time="&timeperiodfilter")
);

which is more readable and therefore easier to maintain?

Re: eliminating duplicates in a data step

robm — Thu, 19 Oct 2017 13:17:05 GMT

You are right it does

thanks Kurt

Re: eliminating duplicates in a data step

robm — Thu, 19 Oct 2017 13:18:11 GMT

yeah I was afraid of that , I currently am using proc sql; with a select distinct to filter them out

Re: eliminating duplicates in a data step

Reeza — Thu, 19 Oct 2017 14:52:14 GMT

Why do you have duplicates in the first place? Seems like it's worth backing up a few steps and removing them there?

Re: eliminating duplicates in a data step

robm — Thu, 19 Oct 2017 15:37:52 GMT

Hi Reza

the original dataset has unique data by the fact that the measure has a $value for classes a student took , I am trying to make a "LIMIT" dataset in the work. area on the fly for this web based stored process so it needs to just get the filter fields in a unique form. But when I use the data step method i have no way of filtering the duopes unles i use the proc sql; or sort ...does that make sense?

Re: eliminating duplicates in a data step

Reeza — Thu, 19 Oct 2017 15:41:47 GMT

It depends, what's the LIMIT data set for? If it's to create some LIMITs I assume you'd be taking some stats on the $ value, which would lead to PROC MEANS or such.

But otherwise, that does seem like the only way. PROC SORT and a data step can identify duplicates but other PROCS are easier to type out so that's a fair approach IMO.

Re: eliminating duplicates in a data step

RLigtenberg — Thu, 19 Oct 2017 21:29:44 GMT

You could consider a hash object to eliminate duplicates in a data step.

%let keep=make model type origin msrp cylinders;
data robm.limitfile1;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))",
       ordered:"ascending");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	H.output(dataset:"work.limit");
  end;
run;

The hash object rejects duplicate keys by default. Leave off the ordered parameter if order does not matter in the result.

Re: eliminating duplicates in a data step

RLigtenberg — Fri, 20 Oct 2017 00:41:54 GMT

And if you need to do additional processing on each observation (after the "filter") you can use a hiter (hash iterator) object.

%let keep=make model type origin msrp cylinders;
data limitfile1;
  length tax 8.;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	declare hiter I("H");
  end;
  rc=0;
  do while(rc=0);
    rc=I.next();
	tax=msrp*.35;
	output;
  end;
  stop;
run;