- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello
Lets say I have a big data set (25 million rows and 30 columns) in a permanent library and I want to select specific rows by criteria.
I just used sashelp.class for example.
From these 4 ways which way is best (Highest run speed) and which is worst(lowest run speed)?
How would you rank these 4 methods by speed?(best to worst)?
Is there any other way to do it?
If I add Index then the 4 ways run more quickly?
Note- in real life data set is much much bigger and sashelp.class is only an example
proc sql;
create table t1 as
select *
from sashelp.class
where sex='F'
;
quit;
Data t2;
set sashelp.class(Where=( sex='F'));
Run;
Data t3(Where=(sex='F'));
set sashelp.class;
Run;
proc append data=sashelp.class(Where=( sex='F')) base=t4;quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could run a test script in your environment to get an idea.
options fullstimer compress=no;
data work.have;
array vars{30} $100 (30*'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA');
do i=1 to 1000000;
if rand('integer',1,10)=1 then sex='F';
else sex='M';
output;
end;
stop;
run;
%macro test(compress=no);
options compress=&compress;
proc datasets lib=work nowarn nolist;
delete t1 t2 t3 t4;
quit;
proc sql;
create table t1 as select * from work.have where sex='F';
quit;
Data t2;
set work.have(Where=(sex='F'));
Run;
Data t3(Where=(sex='F'));
set work.have;
Run;
proc append data=work.have(Where=(sex='F')) base=t4;
quit;
%mend;
%test();
%test();
%test(compress=yes);
%test(compress=yes);
I would expect that you'll get for all 4 methods comparable results.
Your ways 1, 2 and 4 subset the data when reading from source, your way 3 only subsets the data when writing to target.
Way 3 will process all rows from source. If you don't need them (i.e. for aggregations before you write to target) then this way 3 is sub-optimal and depending on what you do in the data step would consume more time than the other 3 methods.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I expect that WAY4 (t4) be the best, Am I right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ronein wrote:
I expect that WAY4 (t4) be the best, Am I right?
I initially thought the same because of the way Proc Append writes data. But then running the performance test script in my environment didn't confirm the theory. May be that's due to the fact that Proc Append needs to read the source data by row for sub-setting and though can't just copy blocks from source to target.
Run the test script in your environment and compare the run times.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
is the data set sorted by that variable you want to filter out on?
Bart
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug
"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings
SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Maxim 4: Try It.
But from experience I prefer option 2, either with the WHERE= dataset option or a WHERE statement.
Unless you read from an external database, where you might want (depending on data structure) to use explicit passthrough first.