Hi,
I am running simple data step with WHERE statement.
Since the source data is enormous(10million rows), it takes time forever.
However, I realized that my computer use only 30% of RAM (out of 16GB), and 7% of CPU.
SASV9.cfg has been modifed but still use the same capacity.
Is there anyway to increase the computer resource use for SAS so that make the process faster?
Thank you.
Since your processing is neither memory or CPU constrained, your conclusion should be that it is IO constrained - your hard drive is maxing out. If you are running this on a PC then installing an SSD (Solid State Drive) and putting your SAS data on that as well as your SAS WORK folder should speed things up.
10 million roes doesn't sound that large, so if it's simple data step processing, I think that your problems are HW related.
If you access a fairly small subset, consider creating an index on your data set.
You could try increase the MEMSIZE SAS system option, but that will probably not make any difference for sequential processing such as data steps.
To look into more detail of your resource consuption, use
options msglevel=i fullstimer;
What does your where clause look like? Separate where statement or data set option?
How many variables are in the data set? Are they all needed? Dropping variables may reduce disk IO somewhat.
There many ways to speed up I/Os.
Storage format (SPDE engine vs V9 engine, compression type) has a huge influence.
An SPDE library with binary compression could speed things up dramatically.
Other issues since you have a where clause are the table sort order and the indexes.
Also, maybe a IF clause would be faster than a WHERE clause. See here:
https://communities.sas.com/t5/Base-SAS-Programming/Using-MERGE-with-LIKE-command/m-p/301654#M63878
Another critical factor is the way you retrieve your data in terms of memory utilisation.
The data set's buffer size, the number of buffers used, whether or not Windows's Direct-IO is used all have a huge effect.
To illustrate this point, here is a benchmark example taken from this book about making SAS faster:
https://www.amazon.com/High-Performance-SAS-Coding
Table 5.5: Run times for various values of BUFSIZE, BUFNO and SGIO for a 10,000 MB data set
10,000,000 kB
|
BUFNO |
|||||
1 |
5 |
25 |
100 |
500 |
||
BUFSIZE |
SGIO |
195.45 |
. |
. |
. |
. |
0 |
no |
|||||
4k |
no |
235.25 |
236.25 |
235.56 |
239.78 |
235.39 |
yes |
1177.55 |
572.32 |
157.28 |
93.60 |
93.96 |
|
8k |
no |
163.77 |
162.05 |
164.22 |
164.05 |
161.34 |
yes |
542.85 |
268.86 |
95.45 |
81.07 |
80.84 |
|
16k |
no |
138.03 |
140.26 |
136.16 |
137.49 |
136.11 |
yes |
265.46 |
142.55 |
75.28 |
74.61 |
81.70 |
|
32k |
no |
143.70 |
148.85 |
137.96 |
147.67 |
143.98 |
yes |
140.56 |
93.84 |
73.35 |
73.33 |
75.48 |
|
64k |
no |
180.47 |
171.39 |
166.63 |
151.69 |
173.16 |
yes |
88.62 |
70.14 |
71.62 |
72.00 |
. |
|
128k |
no |
235.25 |
236.25 |
235.56 |
239.78 |
235.39 |
yes |
70.72 |
71.30 |
71.34 |
70.33 |
. |
The default run time is 195 seconds, which can be reduced to 70s by simply changing the memory usage settings.
It is best for multi-thread PROC DS2. data test; do x=1 to 12316; j=x*x; output; end; run; proc ds2; thread newton/overwrite=yes; dcl double y count; dcl bigint thisThread; drop count; method run(); set test; /* if x=1 ; */ thisThread=_threadid_; count+1; end; method term(); put '**Thread' _threadid_ 'processed' count 'rows:'; end; endthread; run; quit; proc ds2; data want; dcl thread newton frac; method run(); set from frac threads=3; end; enddata; run; quit;
@Ksharp I am pretty sure this would slow down the process considering the data is not spread onto different nodes.
I can't benchmark atm, but if you can...
Chris, No. I can't. This is just intuitive to me . Multiple sessions is best for WHERE / IF situation.
OK, I had time for a quick run.
DS2 improves things slightly because my storage is fast enough (or my CPU cores slow enough) to be CPU-bound for single-table processing.
This doesn't change whether there is a IF/WHERE subset or not, the pattern is the same.
The 10% elapse time reduction happens at a serious CPU cost (in this case, CPU usage doubles).
I still suspect that throwing more CPU at a process that is not CPU-bound would not help, and that DS2 threads would seriously degrade performance on slower storage as the I/Os would be more spread.
My numbers:
options nofullstimer compress=no;
data HAVE;
length A $200;
do I=1 to 1e8;
output;
end;
run;
proc ds2;
thread NEWTON/overwrite=yes;
method run();
set HAVE;
end;
endthread;
run;
quit;
proc ds2;
data WANT;
dcl thread NEWTON FRAC;
method run();
set from FRAC threads=3;
end;
enddata;
run;
quit;
%* real time 27.38 seconds ;
%* cpu time 51.44 seconds ;
data WANT2;
set HAVE;
run;
%* real time 29.31 seconds ;
%* cpu time 29.31 seconds ;
Different (slower, LIN64 instead of WIN64) server, (very) different results, but still no reason to use DS2 for this process.
NOTE: PROCEDURE DS2 used (Total process time):
real time 1:03.27
cpu time 1:05.44
NOTE: DATA statement used (Total process time):
real time 1:05.45
cpu time 34.15 seconds
Chris, You are right. Sad for DS2 .
Hi All,
First of all, thank you so much for your tips and advises.
It was my first time to post my question here.
I am a die-hard DATA step user.
It took me an hour to process very simple Data Step with 1 WHERE statement w several ANDs.
I could not figure it out why the memory usage is so low.
NOTE: DATA statement used (Total process time):
real time 1:01:47.84
user cpu time 27:13.95
system cpu time 1:49.85
memory 1540.78k
OS Memory 31988.00k
This is the time record when I used Proc SQL.
This takes only about 6min which is way quicker than 1 hour with high Memory usage.
NOTE: PROCEDURE SQL used (Total process time):
real time 6:19.32
user cpu time 1:37.65
system cpu time 7.61 seconds
memory 637487.35k
OS Memory 668740.00k
I am not sure what makes this huge real time difference.....
Again, I appreciate for all the replies.
Thank you.
PROC SQL is multi-session oriented PROC, maybe that is reason. Also you can use OPTIONS MSGLEVEL=I ; to check if SQL used index too.
SQL has a slightly different profile than the data step (on the faster windows server).
10% longer elapse and 10% less cpu used.
NOTE: PROCEDURE SQL used (Total process time):
real time 31.58 seconds
cpu time 25.55 seconds
It would be interesting to know more.
Index usage is indeed the first thing to come to mind for such a dramatic improvement, though there is no reason only SQL should use the index.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.