SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

eliminating duplicates in a data step

Reply
Frequent Contributor
Posts: 84

eliminating duplicates in a data step

I have a data step like this that creates a 4gb file , unfortunately it has duplicates that would reduce it too 150,000kb is there an equivalent of “distinct” for the sas syntax below.

 

 

 

  data ROBM.LIMITFILE1;
    set _egimle.&timeperiodvalue (keep=&limitvar1field &limitvar1fielddesc &limitvar2field &limitvar2fielddesc &limitvar3field &limitvar3fielddesc &limitvar4field
                                        &limitvar4fielddesc &limitvar5field &limitvar5fielddesc &limitvar6field &limitvar6fielddesc &limitvar7field &limitvar7fielddesc
                                        &limitvar8field &limitvar8fielddesc filter=("&trendvalue" AND "&timeperiodfilter"));

 

Super User
Posts: 9,868

Re: eliminating duplicates in a data step

Filtering duplicates always involves sorting by the criteria that define the duplicates.

You can either use proc sort with nodupkey, or proc sql with select distinct.

proc sort can be more efficient, as select distinct is equivalent to the noduprec option in proc sort.

If your source dataset is already sorted by the criteria mentioned above, you can use by processing and first. or last. to only accept a single observation per by group.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Frequent Contributor
Posts: 84

Re: eliminating duplicates in a data step

Posted in reply to KurtBremser

yeah I was afraid of that , I currently am using proc sql; with a select distinct to filter them out

Super User
Posts: 23,237

Re: eliminating duplicates in a data step

Why do you have duplicates in the first place? Seems like it's worth backing up a few steps and removing them there?

Frequent Contributor
Posts: 84

Re: eliminating duplicates in a data step

Hi Reza

 

the original dataset has unique data by the fact that the measure has a $value for classes a student took , I am trying to make a "LIMIT" dataset in the work. area on the fly for this web based stored process so it needs to just get the filter fields in a unique form. But when I use the data step method i have no way of filtering the duopes unles i use the proc sql; or sort ...does that make sense?

Super User
Posts: 23,237

Re: eliminating duplicates in a data step

It depends, what's the LIMIT data set for? If it's to create some LIMITs I assume you'd be taking some stats on the $ value, which would lead to PROC MEANS or such.  

 

But otherwise, that does seem like the only way. PROC SORT and a data step can identify duplicates but other PROCS are easier to type out so that's a fair approach IMO.

Super User
Posts: 9,868

Re: eliminating duplicates in a data step

BTW there is no filter= dataset option in SAS.

And compare this code with yours:

data ROBM.LIMITFILE1;
set _egimle.&timeperiodvalue (
  keep=
    &limitvar1field &limitvar1fielddesc
    &limitvar2field &limitvar2fielddesc
    &limitvar3field &limitvar3fielddesc
    &limitvar4field &limitvar4fielddesc
    &limitvar5field &limitvar5fielddesc 
    &limitvar6field &limitvar6fielddesc
    &limitvar7field &limitvar7fielddesc
    &limitvar8field &limitvar8fielddesc
  where=(trend="&trendvalue" AND time="&timeperiodfilter")
);

which is more readable and therefore easier to maintain?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Frequent Contributor
Posts: 84

Re: eliminating duplicates in a data step

Posted in reply to KurtBremser

You are right it does

 

thanks Kurt

SAS Employee
Posts: 17

Re: eliminating duplicates in a data step

You could consider a hash object to eliminate duplicates in a data step.

 

%let keep=make model type origin msrp cylinders;
data robm.limitfile1;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))",
       ordered:"ascending");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	H.output(dataset:"work.limit");
  end;
run;

The hash object rejects duplicate keys by default. Leave off the ordered parameter if order does not matter in the result.

 

SAS Employee
Posts: 17

Re: eliminating duplicates in a data step

Posted in reply to RobertLigtenberg

And if you need to do additional processing on each observation (after the "filter") you can use a hiter (hash iterator) object.

 

%let keep=make model type origin msrp cylinders;
data limitfile1;
  length tax 8.;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	declare hiter I("H");
  end;
  rc=0;
  do while(rc=0);
    rc=I.next();
	tax=msrp*.35;
	output;
  end;
  stop;
run;

 

Ask a Question
Discussion stats
  • 9 replies
  • 204 views
  • 0 likes
  • 4 in conversation