BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15

Hello

Lets say I have a big data set (25 million  rows and 30 columns) in a permanent library and I want to select specific rows by criteria.

I just used sashelp.class for example.

From these 4 ways which way is best (Highest run speed) and which is worst(lowest run speed)?

How would you rank these 4 methods by speed?(best to worst)?

Is there any other way to do it?

If I add Index then the 4 ways run more quickly?

Note- in real life data set is much much bigger and sashelp.class is only an example

 

proc sql;
create table t1 as
select *
from  sashelp.class
where sex='F'
;
quit;

Data t2;
set sashelp.class(Where=( sex='F'));
Run;

Data t3(Where=(sex='F'));
set sashelp.class;
Run;

proc append data=sashelp.class(Where=( sex='F')) base=t4;quit;

 

5 REPLIES 5
Patrick
Opal | Level 21

You could run a test script in your environment to get an idea.

options fullstimer compress=no;

data work.have;
  array vars{30} $100 (30*'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA');
  do i=1 to 1000000;
    if rand('integer',1,10)=1 then sex='F';
    else sex='M';
    output;
  end;
  stop;
run;

%macro test(compress=no);
  options compress=&compress;
  proc datasets lib=work nowarn nolist;
    delete t1 t2 t3 t4;
  quit;
  
  proc sql;
    create table t1 as select * from work.have where sex='F';
  quit;
  
  Data t2;
    set work.have(Where=(sex='F'));
  Run;
  
  Data t3(Where=(sex='F'));
    set work.have;
  Run;
  
  proc append data=work.have(Where=(sex='F')) base=t4;
  quit;
%mend;
%test();
%test();
%test(compress=yes);
%test(compress=yes);

I would expect that you'll get for all 4 methods comparable results. 
Your ways 1, 2 and 4 subset the data when reading from source, your way 3 only subsets the data when writing to target.

Way 3 will process all rows from source. If you don't need them (i.e. for aggregations before you write to target) then this way 3 is sub-optimal and depending on what you do in the data step would consume more time than the other 3 methods.

Ronein
Onyx | Level 15

I expect that WAY4 (t4) be the best, Am I right?

Patrick
Opal | Level 21

@Ronein wrote:

I expect that WAY4 (t4) be the best, Am I right?


I initially thought the same because of the way Proc Append writes data. But then running the performance test script in my environment didn't confirm the theory. May be that's due to the fact that Proc Append needs to read the source data by row for sub-setting and though can't just copy blocks from source to target.

 

Run the test script in your environment and compare the run times.

yabwon
Amethyst | Level 16

is the data set sorted by that variable you want to filter out on?

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Kurt_Bremser
Super User

Maxim 4: Try It.

 

But from experience I prefer option 2, either with the WHERE= dataset option or a WHERE statement.

Unless you read from an external database, where you might want (depending on data structure) to use explicit passthrough first.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 890 views
  • 1 like
  • 4 in conversation