I have a data step like this that creates a 4gb file , unfortunately it has duplicates that would reduce it too 150,000kb is there an equivalent of “distinct” for the sas syntax below.
data ROBM.LIMITFILE1; set _egimle.&timeperiodvalue (keep=&limitvar1field &limitvar1fielddesc &limitvar2field &limitvar2fielddesc &limitvar3field &limitvar3fielddesc &limitvar4field &limitvar4fielddesc &limitvar5field &limitvar5fielddesc &limitvar6field &limitvar6fielddesc &limitvar7field &limitvar7fielddesc &limitvar8field &limitvar8fielddesc filter=("&trendvalue" AND "&timeperiodfilter"));
Filtering duplicates always involves sorting by the criteria that define the duplicates.
You can either use proc sort with nodupkey, or proc sql with select distinct.
proc sort can be more efficient, as select distinct is equivalent to the noduprec option in proc sort.
If your source dataset is already sorted by the criteria mentioned above, you can use by processing and first. or last. to only accept a single observation per by group.
yeah I was afraid of that , I currently am using proc sql; with a select distinct to filter them out
Why do you have duplicates in the first place? Seems like it's worth backing up a few steps and removing them there?
Hi Reza
the original dataset has unique data by the fact that the measure has a $value for classes a student took , I am trying to make a "LIMIT" dataset in the work. area on the fly for this web based stored process so it needs to just get the filter fields in a unique form. But when I use the data step method i have no way of filtering the duopes unles i use the proc sql; or sort ...does that make sense?
It depends, what's the LIMIT data set for? If it's to create some LIMITs I assume you'd be taking some stats on the $ value, which would lead to PROC MEANS or such.
But otherwise, that does seem like the only way. PROC SORT and a data step can identify duplicates but other PROCS are easier to type out so that's a fair approach IMO.
BTW there is no filter= dataset option in SAS.
And compare this code with yours:
data ROBM.LIMITFILE1;
set _egimle.&timeperiodvalue (
keep=
&limitvar1field &limitvar1fielddesc
&limitvar2field &limitvar2fielddesc
&limitvar3field &limitvar3fielddesc
&limitvar4field &limitvar4fielddesc
&limitvar5field &limitvar5fielddesc
&limitvar6field &limitvar6fielddesc
&limitvar7field &limitvar7fielddesc
&limitvar8field &limitvar8fielddesc
where=(trend="&trendvalue" AND time="&timeperiodfilter")
);
which is more readable and therefore easier to maintain?
You are right it does
thanks Kurt
You could consider a hash object to eliminate duplicates in a data step.
%let keep=make model type origin msrp cylinders;
data robm.limitfile1;
if _N_=1 then do;
if 0 then set sashelp.cars(keep=&keep);
declare hash H(
dataset:"sashelp.cars(keep=&keep where=(cylinders=6))",
ordered:"ascending");
H.definekey("make");
H.definedata("make","model","type","origin","msrp","cylinders");
H.definedone();
H.output(dataset:"work.limit");
end;
run;
The hash object rejects duplicate keys by default. Leave off the ordered parameter if order does not matter in the result.
And if you need to do additional processing on each observation (after the "filter") you can use a hiter (hash iterator) object.
%let keep=make model type origin msrp cylinders;
data limitfile1;
length tax 8.;
if _N_=1 then do;
if 0 then set sashelp.cars(keep=&keep);
declare hash H(
dataset:"sashelp.cars(keep=&keep where=(cylinders=6))");
H.definekey("make");
H.definedata("make","model","type","origin","msrp","cylinders");
H.definedone();
declare hiter I("H");
end;
rc=0;
do while(rc=0);
rc=I.next();
tax=msrp*.35;
output;
end;
stop;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.