Solved: DATA Step: Subsetting Inside Versus Outside Parentheses

Junyong · Posted 01-17-2021 04:21 AM

For a SAS data set like

data i;
	array i(5000) i1-i5000;
	do j=1 to 5000;
		do k=1 to 5000;
			i(k)=rannor(1);
		end;
		output;
	end;
	drop j k;
run;

one may be able to subset via KEEP and WHERE as follows.

data i1;
	set i;
	where i1>0 & i2>0;
	keep i1-i1000;
run;

The DATA above will apply WHERE first but read and write all 5,000 variables before applying KEEP. There may be three more versions to do this as follows.

data i2;
	set i(where=(i1>0 & i2>0));
	keep i1-i1000;
run;

data i3;
	set i(keep=i1-i1000);
	where i1>0 & i2>0;
run;

data i4;
	set i(keep=i1-i1000 where=(i1>0 & i2>0));
run;

1. I thought the second version will be less efficient than the first version because it implicitly takes one more step before the explicit DATA, so I have tried to avoid this unless I need separate WHEREs for multiple sets (for example, before merging multiple data sets).

2. The third version introduces one more implicit step but doesn't read and write all the 5,000 variables, so I think there will be a trade-off but am not sure.

3. The fourth version, like the third version, applies KEEP first and then WHERE but due to a different reason.

Or one can also consider PROC SQL unless the 1,000 variables above.

proc sql;
	create table i5 as
	select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
	from i where i1>0 & i2>0;
quit;

In this case, PROC SQL may only consider the 10 variables stated so will be useful unless sequential access (such as LAG) is required.

Many documents say IF and WHERE are different, but I am not sure whether there is an important performance difference between using parentheses or not (if there is nothing more to be considered). It seems there is a trade-off between introducing one more implicit step and not rewriting all the unnecessary variables (or the performance may be only marginally different, as the following log shows).

1    data i;
2        array i(5000) i1-i5000;
3        do j=1 to 5000;
4            do k=1 to 5000;
5                i(k)=rannor(1);
6            end;
7            output;
8        end;
9        drop j k;
10   run;

NOTE: The data set WORK.I has 5000 observations and 5000
      variables.
NOTE: DATA statement used (Total process time):
      real time           1.70 seconds
      cpu time            1.68 seconds


11
12   data i1;
13       set i;
14       where i1>0 & i2>0;
15       keep i1-i1000;
16   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I1 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


17
18   data i2;
19       set i(where=(i1>0 & i2>0));
20       keep i1-i1000;
21   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I2 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


22
23   data i3;
24       set i(keep=i1-i1000);
25       where i1>0 & i2>0;
26   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I3 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.17 seconds


27
28   data i4;
29       set i(keep=i1-i1000 where=(i1>0 & i2>0));
30   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I4 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.15 seconds
      cpu time            0.15 seconds


31   proc sql;
32       create table i5 as
33       select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
34       from i where i1>0 & i2>0;
NOTE: Table WORK.I5 created, with 1207 rows and 10 columns.

35   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds

Thanks for all your help.

Kurt_Bremser · Posted 01-17-2021 05:01 AM

The deciding factor in a pure sequential processing of a dataset is always the performance of your storage. Reading and writing take most of the time, the CPU outperforms storage by orders of magnitude.

Given the structure of SAS datasets, all your steps need to read the whole dataset physically, so you will find no significant difference between all those steps.

Your question is not about parentheses, it is about the difference of the WHERE and KEEP statements vs. the WHERE= and KEEP= dataset options.

The WHERE statement and dataset option are applied to the input; the statement globally to all datasets read, while the dataset option is applied individually to each dataset.

The KEEP dataset option is applied to the input, but the KEEP statement works on the output. This means that the steps with the KEEP statement will have a larger PDV. Once again, given today's performance of CPU and RAM (especially the cache memory contained in the CPU chip), this is insignificant.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

PeterClemmensen · Posted 01-17-2021 04:28 AM

I'm not sure what the question is here? 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Junyong · Posted 01-17-2021 04:58 AM

My apologies—in short, I wonder if there is an important performance gain or loss in the following four different approaches (ceteris paribus).

data i1;
	set i;
	where i1>0 & i2>0;
	keep i1-i1000;
run;

data i2;
	set i(where=(i1>0 & i2>0));
	keep i1-i1000;
run;

data i3;
	set i(keep=i1-i1000);
	where i1>0 & i2>0;
run;

data i4;
	set i(keep=i1-i1000 where=(i1>0 & i2>0));
run;

It seems I1 rewrites all the 5,000 variables before KEEPing while I3 doesn't, but I3 takes two steps while I1 does only one.

PeterClemmensen · Posted 01-17-2021 05:19 AM

The answer is: In general: It depends.

First off, there is no difference between the Where Statement and the Where= Data Set Option. Both are applied to the input data.

There is a difference between the Keep Statement and the Keep= Data Set Option. The Keep= Data Set Option is applied to the input data. So in i2, all the variables are read in, while in i3 only i1-i1000 are read.

In your example: No difference. Since you already have only i1-i1000 in your input data, there will be no significant performance difference between the data steps.

However, try throwing another 1000 variables (like z1-z1000) in there and increase the number of observations. Then, you'll see a performance difference 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 01-17-2021 05:01 AM

The deciding factor in a pure sequential processing of a dataset is always the performance of your storage. Reading and writing take most of the time, the CPU outperforms storage by orders of magnitude.

Given the structure of SAS datasets, all your steps need to read the whole dataset physically, so you will find no significant difference between all those steps.

Your question is not about parentheses, it is about the difference of the WHERE and KEEP statements vs. the WHERE= and KEEP= dataset options.

The WHERE statement and dataset option are applied to the input; the statement globally to all datasets read, while the dataset option is applied individually to each dataset.

The KEEP dataset option is applied to the input, but the KEEP statement works on the output. This means that the steps with the KEEP statement will have a larger PDV. Once again, given today's performance of CPU and RAM (especially the cache memory contained in the CPU chip), this is insignificant.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Shmuel · Posted 01-17-2021 05:12 AM

As much as I know,

- WHERE saves CPU time - dealing with less data

- KEEP saves I/O on writing the output and saves CPU on both input and output

- First submitting may need more time for setup preparations

- SQL may need more time for setup then a data step

Kurt_Bremser · Posted 01-17-2021 05:24 AM

Also keep in mind that your test data and SQL steps follow the creation of the dataset immediately. This means that the dataset will still be contained in the system's file cache, so no physical reads from storage have to be done; this explains the near-zero difference between real and CPU time.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

DATA Step: Subsetting Inside Versus Outside Parentheses

Re: DATA Step: Subsetting Inside Versus Outside Parentheses

Re: DATA Step: Subsetting Inside Versus Outside Parentheses

Re: DATA Step: Subsetting Inside Versus Outside Parentheses

Re: DATA Step: Subsetting Inside Versus Outside Parentheses

Re: DATA Step: Subsetting Inside Versus Outside Parentheses

Re: DATA Step: Subsetting Inside Versus Outside Parentheses

Re: DATA Step: Subsetting Inside Versus Outside Parentheses

Registration is open

SAS Training: Just a Click Away