Re: eliminating duplicates in a data step

robm · Posted 10-19-2017 02:01 AM

I have a data step like this that creates a 4gb file , unfortunately it has duplicates that would reduce it too 150,000kb is there an equivalent of “distinct” for the sas syntax below.

  data ROBM.LIMITFILE1;
    set _egimle.&timeperiodvalue (keep=&limitvar1field &limitvar1fielddesc &limitvar2field &limitvar2fielddesc &limitvar3field &limitvar3fielddesc &limitvar4field
                                        &limitvar4fielddesc &limitvar5field &limitvar5fielddesc &limitvar6field &limitvar6fielddesc &limitvar7field &limitvar7fielddesc
                                        &limitvar8field &limitvar8fielddesc filter=("&trendvalue" AND "&timeperiodfilter"));

Kurt_Bremser · Posted 10-19-2017 03:05 AM

Filtering duplicates always involves sorting by the criteria that define the duplicates.

You can either use proc sort with nodupkey, or proc sql with select distinct.

proc sort can be more efficient, as select distinct is equivalent to the noduprec option in proc sort.

If your source dataset is already sorted by the criteria mentioned above, you can use by processing and first. or last. to only accept a single observation per by group.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

robm · Posted 10-19-2017 09:18 AM

yeah I was afraid of that , I currently am using proc sql; with a select distinct to filter them out

Reeza · Posted 10-19-2017 10:52 AM

Why do you have duplicates in the first place? Seems like it's worth backing up a few steps and removing them there?

robm · Posted 10-19-2017 11:37 AM

Hi Reza

the original dataset has unique data by the fact that the measure has a $value for classes a student took , I am trying to make a "LIMIT" dataset in the work. area on the fly for this web based stored process so it needs to just get the filter fields in a unique form. But when I use the data step method i have no way of filtering the duopes unles i use the proc sql; or sort ...does that make sense?

Reeza · Posted 10-19-2017 11:41 AM

It depends, what's the LIMIT data set for? If it's to create some LIMITs I assume you'd be taking some stats on the $ value, which would lead to PROC MEANS or such.

But otherwise, that does seem like the only way. PROC SORT and a data step can identify duplicates but other PROCS are easier to type out so that's a fair approach IMO.

Kurt_Bremser · Posted 10-19-2017 03:09 AM

BTW there is no filter= dataset option in SAS.

And compare this code with yours:

data ROBM.LIMITFILE1;
set _egimle.&timeperiodvalue (
  keep=
    &limitvar1field &limitvar1fielddesc
    &limitvar2field &limitvar2fielddesc
    &limitvar3field &limitvar3fielddesc
    &limitvar4field &limitvar4fielddesc
    &limitvar5field &limitvar5fielddesc 
    &limitvar6field &limitvar6fielddesc
    &limitvar7field &limitvar7fielddesc
    &limitvar8field &limitvar8fielddesc
  where=(trend="&trendvalue" AND time="&timeperiodfilter")
);

which is more readable and therefore easier to maintain?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

robm · Posted 10-19-2017 09:17 AM

You are right it does

thanks Kurt

RLigtenberg · Posted 10-19-2017 05:29 PM

You could consider a hash object to eliminate duplicates in a data step.

%let keep=make model type origin msrp cylinders;
data robm.limitfile1;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))",
       ordered:"ascending");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	H.output(dataset:"work.limit");
  end;
run;

The hash object rejects duplicate keys by default. Leave off the ordered parameter if order does not matter in the result.

RLigtenberg · Posted 10-19-2017 08:41 PM

And if you need to do additional processing on each observation (after the "filter") you can use a hiter (hash iterator) object.

%let keep=make model type origin msrp cylinders;
data limitfile1;
  length tax 8.;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	declare hiter I("H");
  end;
  rc=0;
  do while(rc=0);
    rc=I.next();
	tax=msrp*.35;
	output;
  end;
  stop;
run;