BookmarkSubscribeRSS Feed
robm
Quartz | Level 8

I have a data step like this that creates a 4gb file , unfortunately it has duplicates that would reduce it too 150,000kb is there an equivalent of “distinct” for the sas syntax below.

 

 

 

  data ROBM.LIMITFILE1;
    set _egimle.&timeperiodvalue (keep=&limitvar1field &limitvar1fielddesc &limitvar2field &limitvar2fielddesc &limitvar3field &limitvar3fielddesc &limitvar4field
                                        &limitvar4fielddesc &limitvar5field &limitvar5fielddesc &limitvar6field &limitvar6fielddesc &limitvar7field &limitvar7fielddesc
                                        &limitvar8field &limitvar8fielddesc filter=("&trendvalue" AND "&timeperiodfilter"));

 

9 REPLIES 9
Kurt_Bremser
Super User

Filtering duplicates always involves sorting by the criteria that define the duplicates.

You can either use proc sort with nodupkey, or proc sql with select distinct.

proc sort can be more efficient, as select distinct is equivalent to the noduprec option in proc sort.

If your source dataset is already sorted by the criteria mentioned above, you can use by processing and first. or last. to only accept a single observation per by group.

robm
Quartz | Level 8

yeah I was afraid of that , I currently am using proc sql; with a select distinct to filter them out

Reeza
Super User

Why do you have duplicates in the first place? Seems like it's worth backing up a few steps and removing them there?

robm
Quartz | Level 8

Hi Reza

 

the original dataset has unique data by the fact that the measure has a $value for classes a student took , I am trying to make a "LIMIT" dataset in the work. area on the fly for this web based stored process so it needs to just get the filter fields in a unique form. But when I use the data step method i have no way of filtering the duopes unles i use the proc sql; or sort ...does that make sense?

Reeza
Super User

It depends, what's the LIMIT data set for? If it's to create some LIMITs I assume you'd be taking some stats on the $ value, which would lead to PROC MEANS or such.  

 

But otherwise, that does seem like the only way. PROC SORT and a data step can identify duplicates but other PROCS are easier to type out so that's a fair approach IMO.

Kurt_Bremser
Super User

BTW there is no filter= dataset option in SAS.

And compare this code with yours:

data ROBM.LIMITFILE1;
set _egimle.&timeperiodvalue (
  keep=
    &limitvar1field &limitvar1fielddesc
    &limitvar2field &limitvar2fielddesc
    &limitvar3field &limitvar3fielddesc
    &limitvar4field &limitvar4fielddesc
    &limitvar5field &limitvar5fielddesc 
    &limitvar6field &limitvar6fielddesc
    &limitvar7field &limitvar7fielddesc
    &limitvar8field &limitvar8fielddesc
  where=(trend="&trendvalue" AND time="&timeperiodfilter")
);

which is more readable and therefore easier to maintain?

robm
Quartz | Level 8

You are right it does

 

thanks Kurt

RLigtenberg
SAS Employee

You could consider a hash object to eliminate duplicates in a data step.

 

%let keep=make model type origin msrp cylinders;
data robm.limitfile1;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))",
       ordered:"ascending");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	H.output(dataset:"work.limit");
  end;
run;

The hash object rejects duplicate keys by default. Leave off the ordered parameter if order does not matter in the result.

 

RLigtenberg
SAS Employee

And if you need to do additional processing on each observation (after the "filter") you can use a hiter (hash iterator) object.

 

%let keep=make model type origin msrp cylinders;
data limitfile1;
  length tax 8.;
  if _N_=1 then do;
    if 0 then set sashelp.cars(keep=&keep);
    declare hash H(
       dataset:"sashelp.cars(keep=&keep where=(cylinders=6))");
	H.definekey("make");
	H.definedata("make","model","type","origin","msrp","cylinders");
	H.definedone();
	declare hiter I("H");
  end;
  rc=0;
  do while(rc=0);
    rc=I.next();
	tax=msrp*.35;
	output;
  end;
  stop;
run;

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1564 views
  • 0 likes
  • 4 in conversation