SAS Programming

Ronein

Hello

Lets say I have a big data set (25 million rows and 30 columns) in a permanent library and I want to select specific rows by criteria.

I just used sashelp.class for example.

From these 4 ways which way is best (Highest run speed) and which is worst(lowest run speed)?

How would you rank these 4 methods by speed?(best to worst)?

Is there any other way to do it?

If I add Index then the 4 ways run more quickly?

Note- in real life data set is much much bigger and sashelp.class is only an example

proc sql;
create table t1 as
select *
from  sashelp.class
where sex='F'
;
quit;

Data t2;
set sashelp.class(Where=( sex='F'));
Run;

Data t3(Where=(sex='F'));
set sashelp.class;
Run;

proc append data=sashelp.class(Where=( sex='F')) base=t4;quit;

Patrick

You could run a test script in your environment to get an idea.

options fullstimer compress=no;

data work.have;
  array vars{30} $100 (30*'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA');
  do i=1 to 1000000;
    if rand('integer',1,10)=1 then sex='F';
    else sex='M';
    output;
  end;
  stop;
run;

%macro test(compress=no);
  options compress=&compress;
  proc datasets lib=work nowarn nolist;
    delete t1 t2 t3 t4;
  quit;
  
  proc sql;
    create table t1 as select * from work.have where sex='F';
  quit;
  
  Data t2;
    set work.have(Where=(sex='F'));
  Run;
  
  Data t3(Where=(sex='F'));
    set work.have;
  Run;
  
  proc append data=work.have(Where=(sex='F')) base=t4;
  quit;
%mend;
%test();
%test();
%test(compress=yes);
%test(compress=yes);

I would expect that you'll get for all 4 methods comparable results.
Your ways 1, 2 and 4 subset the data when reading from source, your way 3 only subsets the data when writing to target.

Way 3 will process all rows from source. If you don't need them (i.e. for aggregations before you write to target) then this way 3 is sub-optimal and depending on what you do in the data step would consume more time than the other 3 methods.

Ronein

I expect that WAY4 (t4) be the best, Am I right?

Patrick

@Ronein wrote:

I expect that WAY4 (t4) be the best, Am I right?

I initially thought the same because of the way Proc Append writes data. But then running the performance test script in my environment didn't confirm the theory. May be that's due to the fact that Proc Append needs to read the source data by row for sub-setting and though can't just copy blocks from source to target.

Run the test script in your environment and compare the run times.

yabwon

is the data set sorted by that variable you want to filter out on?

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Kurt_Bremser

Maxim 4: Try It.

But from experience I prefer option 2, either with the WHERE= dataset option or a WHERE statement.

Unless you read from an external database, where you might want (depending on data structure) to use explicit passthrough first.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Programming

SET data set-Run speed

Re: SET data set-Run speed

Re: SET data set-Run speed

Re: SET data set-Run speed

Re: SET data set-Run speed

Re: SET data set-Run speed

[SAS 활용 FAQ] DATA SET 결합 - 세로결합

Speed dialing: Connecting SAS Enterprise Guide to a containerized SAS ...

Run Proc Report only if there are observations in a data set

SET multiuple data sets

merge two data sets with one table at a time

Follow Us

What is...

SAS Programming

Register Today!

SAS Training: Just a Click Away

Follow Us

What is...