For a SAS data set like
data i;
array i(5000) i1-i5000;
do j=1 to 5000;
do k=1 to 5000;
i(k)=rannor(1);
end;
output;
end;
drop j k;
run;
one may be able to subset via KEEP and WHERE as follows.
data i1;
set i;
where i1>0 & i2>0;
keep i1-i1000;
run;
The DATA above will apply WHERE first but read and write all 5,000 variables before applying KEEP. There may be three more versions to do this as follows.
data i2;
set i(where=(i1>0 & i2>0));
keep i1-i1000;
run;
data i3;
set i(keep=i1-i1000);
where i1>0 & i2>0;
run;
data i4;
set i(keep=i1-i1000 where=(i1>0 & i2>0));
run;
1. I thought the second version will be less efficient than the first version because it implicitly takes one more step before the explicit DATA, so I have tried to avoid this unless I need separate WHEREs for multiple sets (for example, before merging multiple data sets).
2. The third version introduces one more implicit step but doesn't read and write all the 5,000 variables, so I think there will be a trade-off but am not sure.
3. The fourth version, like the third version, applies KEEP first and then WHERE but due to a different reason.
Or one can also consider PROC SQL unless the 1,000 variables above.
proc sql;
create table i5 as
select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
from i where i1>0 & i2>0;
quit;
In this case, PROC SQL may only consider the 10 variables stated so will be useful unless sequential access (such as LAG) is required.
Many documents say IF and WHERE are different, but I am not sure whether there is an important performance difference between using parentheses or not (if there is nothing more to be considered). It seems there is a trade-off between introducing one more implicit step and not rewriting all the unnecessary variables (or the performance may be only marginally different, as the following log shows).
1 data i;
2 array i(5000) i1-i5000;
3 do j=1 to 5000;
4 do k=1 to 5000;
5 i(k)=rannor(1);
6 end;
7 output;
8 end;
9 drop j k;
10 run;
NOTE: The data set WORK.I has 5000 observations and 5000
variables.
NOTE: DATA statement used (Total process time):
real time 1.70 seconds
cpu time 1.68 seconds
11
12 data i1;
13 set i;
14 where i1>0 & i2>0;
15 keep i1-i1000;
16 run;
NOTE: There were 1207 observations read from the data set
WORK.I.
WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I1 has 1207 observations and 1000
variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.15 seconds
17
18 data i2;
19 set i(where=(i1>0 & i2>0));
20 keep i1-i1000;
21 run;
NOTE: There were 1207 observations read from the data set
WORK.I.
WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I2 has 1207 observations and 1000
variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.15 seconds
22
23 data i3;
24 set i(keep=i1-i1000);
25 where i1>0 & i2>0;
26 run;
NOTE: There were 1207 observations read from the data set
WORK.I.
WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I3 has 1207 observations and 1000
variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.17 seconds
27
28 data i4;
29 set i(keep=i1-i1000 where=(i1>0 & i2>0));
30 run;
NOTE: There were 1207 observations read from the data set
WORK.I.
WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I4 has 1207 observations and 1000
variables.
NOTE: DATA statement used (Total process time):
real time 0.15 seconds
cpu time 0.15 seconds
31 proc sql;
32 create table i5 as
33 select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
34 from i where i1>0 & i2>0;
NOTE: Table WORK.I5 created, with 1207 rows and 10 columns.
35 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.17 seconds
cpu time 0.15 seconds
Thanks for all your help.
The deciding factor in a pure sequential processing of a dataset is always the performance of your storage. Reading and writing take most of the time, the CPU outperforms storage by orders of magnitude.
Given the structure of SAS datasets, all your steps need to read the whole dataset physically, so you will find no significant difference between all those steps.
Your question is not about parentheses, it is about the difference of the WHERE and KEEP statements vs. the WHERE= and KEEP= dataset options.
The WHERE statement and dataset option are applied to the input; the statement globally to all datasets read, while the dataset option is applied individually to each dataset.
The KEEP dataset option is applied to the input, but the KEEP statement works on the output. This means that the steps with the KEEP statement will have a larger PDV. Once again, given today's performance of CPU and RAM (especially the cache memory contained in the CPU chip), this is insignificant.
I'm not sure what the question is here? 🙂
My apologies—in short, I wonder if there is an important performance gain or loss in the following four different approaches (ceteris paribus).
data i1;
set i;
where i1>0 & i2>0;
keep i1-i1000;
run;
data i2;
set i(where=(i1>0 & i2>0));
keep i1-i1000;
run;
data i3;
set i(keep=i1-i1000);
where i1>0 & i2>0;
run;
data i4;
set i(keep=i1-i1000 where=(i1>0 & i2>0));
run;
It seems I1 rewrites all the 5,000 variables before KEEPing while I3 doesn't, but I3 takes two steps while I1 does only one.
The answer is: In general: It depends.
First off, there is no difference between the Where Statement and the Where= Data Set Option. Both are applied to the input data.
There is a difference between the Keep Statement and the Keep= Data Set Option. The Keep= Data Set Option is applied to the input data. So in i2, all the variables are read in, while in i3 only i1-i1000 are read.
In your example: No difference. Since you already have only i1-i1000 in your input data, there will be no significant performance difference between the data steps.
However, try throwing another 1000 variables (like z1-z1000) in there and increase the number of observations. Then, you'll see a performance difference 🙂
The deciding factor in a pure sequential processing of a dataset is always the performance of your storage. Reading and writing take most of the time, the CPU outperforms storage by orders of magnitude.
Given the structure of SAS datasets, all your steps need to read the whole dataset physically, so you will find no significant difference between all those steps.
Your question is not about parentheses, it is about the difference of the WHERE and KEEP statements vs. the WHERE= and KEEP= dataset options.
The WHERE statement and dataset option are applied to the input; the statement globally to all datasets read, while the dataset option is applied individually to each dataset.
The KEEP dataset option is applied to the input, but the KEEP statement works on the output. This means that the steps with the KEEP statement will have a larger PDV. Once again, given today's performance of CPU and RAM (especially the cache memory contained in the CPU chip), this is insignificant.
As much as I know,
- WHERE saves CPU time - dealing with less data
- KEEP saves I/O on writing the output and saves CPU on both input and output
- First submitting may need more time for setup preparations
- SQL may need more time for setup then a data step
Also keep in mind that your test data and SQL steps follow the creation of the dataset immediately. This means that the dataset will still be contained in the system's file cache, so no physical reads from storage have to be done; this explains the near-zero difference between real and CPU time.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.