BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

For a SAS data set like

data i;
	array i(5000) i1-i5000;
	do j=1 to 5000;
		do k=1 to 5000;
			i(k)=rannor(1);
		end;
		output;
	end;
	drop j k;
run;

one may be able to subset via KEEP and WHERE as follows.

data i1;
	set i;
	where i1>0 & i2>0;
	keep i1-i1000;
run;

The DATA above will apply WHERE first but read and write all 5,000 variables before applying KEEP. There may be three more versions to do this as follows.

data i2;
	set i(where=(i1>0 & i2>0));
	keep i1-i1000;
run;

data i3;
	set i(keep=i1-i1000);
	where i1>0 & i2>0;
run;

data i4;
	set i(keep=i1-i1000 where=(i1>0 & i2>0));
run;

1. I thought the second version will be less efficient than the first version because it implicitly takes one more step before the explicit DATA, so I have tried to avoid this unless I need separate WHEREs for multiple sets (for example, before merging multiple data sets).

2. The third version introduces one more implicit step but doesn't read and write all the 5,000 variables, so I think there will be a trade-off but am not sure.

3. The fourth version, like the third version, applies KEEP first and then WHERE but due to a different reason.

Or one can also consider PROC SQL unless the 1,000 variables above.

proc sql;
	create table i5 as
	select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
	from i where i1>0 & i2>0;
quit;

In this case, PROC SQL may only consider the 10 variables stated so will be useful unless sequential access (such as LAG) is required.

Many documents say IF and WHERE are different, but I am not sure whether there is an important performance difference between using parentheses or not (if there is nothing more to be considered). It seems there is a trade-off between introducing one more implicit step and not rewriting all the unnecessary variables (or the performance may be only marginally different, as the following log shows).

1    data i;
2        array i(5000) i1-i5000;
3        do j=1 to 5000;
4            do k=1 to 5000;
5                i(k)=rannor(1);
6            end;
7            output;
8        end;
9        drop j k;
10   run;

NOTE: The data set WORK.I has 5000 observations and 5000
      variables.
NOTE: DATA statement used (Total process time):
      real time           1.70 seconds
      cpu time            1.68 seconds


11
12   data i1;
13       set i;
14       where i1>0 & i2>0;
15       keep i1-i1000;
16   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I1 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


17
18   data i2;
19       set i(where=(i1>0 & i2>0));
20       keep i1-i1000;
21   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I2 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


22
23   data i3;
24       set i(keep=i1-i1000);
25       where i1>0 & i2>0;
26   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I3 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.17 seconds


27
28   data i4;
29       set i(keep=i1-i1000 where=(i1>0 & i2>0));
30   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1>0) and (i2>0);
NOTE: The data set WORK.I4 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.15 seconds
      cpu time            0.15 seconds


31   proc sql;
32       create table i5 as
33       select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
34       from i where i1>0 & i2>0;
NOTE: Table WORK.I5 created, with 1207 rows and 10 columns.

35   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds

Thanks for all your help.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

The deciding factor in a pure sequential processing of a dataset is always the performance of your storage. Reading and writing take most of the time, the CPU outperforms storage by orders of magnitude.

Given the structure of SAS datasets, all your steps need to read the whole dataset physically, so you will find no significant difference between all those steps.

 

Your question is not about parentheses, it is about the difference of the WHERE and KEEP statements vs. the WHERE= and KEEP= dataset options.

The WHERE statement and dataset option are applied to the input; the statement globally to all datasets read, while the dataset option is applied individually to each dataset.

The KEEP dataset option is applied to the input, but the KEEP statement works on the output. This means that the steps with the KEEP statement will have a larger PDV. Once again, given today's performance of CPU and RAM (especially the cache memory contained in the CPU chip), this is insignificant.

View solution in original post

6 REPLIES 6
PeterClemmensen
Tourmaline | Level 20

I'm not sure what the question is here? 🙂

Junyong
Pyrite | Level 9

My apologies—in short, I wonder if there is an important performance gain or loss in the following four different approaches (ceteris paribus).

data i1;
	set i;
	where i1>0 & i2>0;
	keep i1-i1000;
run;

data i2;
	set i(where=(i1>0 & i2>0));
	keep i1-i1000;
run;

data i3;
	set i(keep=i1-i1000);
	where i1>0 & i2>0;
run;

data i4;
	set i(keep=i1-i1000 where=(i1>0 & i2>0));
run;

It seems I1 rewrites all the 5,000 variables before KEEPing while I3 doesn't, but I3 takes two steps while I1 does only one.

PeterClemmensen
Tourmaline | Level 20

The answer is: In general: It depends. 

 

First off, there is no difference between the Where Statement and the Where= Data Set Option. Both are applied to the input data. 

 

There is a difference between the Keep Statement and the Keep= Data Set Option. The Keep= Data Set Option is applied to the input data. So in i2, all the variables are read in, while in i3 only i1-i1000 are read.

 

In your example: No difference. Since you already have only i1-i1000 in your input data, there will be no significant performance difference between the data steps.

 

However, try throwing another 1000 variables (like z1-z1000) in there and increase the number of observations. Then, you'll see a performance difference 🙂

Kurt_Bremser
Super User

The deciding factor in a pure sequential processing of a dataset is always the performance of your storage. Reading and writing take most of the time, the CPU outperforms storage by orders of magnitude.

Given the structure of SAS datasets, all your steps need to read the whole dataset physically, so you will find no significant difference between all those steps.

 

Your question is not about parentheses, it is about the difference of the WHERE and KEEP statements vs. the WHERE= and KEEP= dataset options.

The WHERE statement and dataset option are applied to the input; the statement globally to all datasets read, while the dataset option is applied individually to each dataset.

The KEEP dataset option is applied to the input, but the KEEP statement works on the output. This means that the steps with the KEEP statement will have a larger PDV. Once again, given today's performance of CPU and RAM (especially the cache memory contained in the CPU chip), this is insignificant.

Shmuel
Garnet | Level 18

As much as I know,

- WHERE saves CPU time - dealing with less data

- KEEP saves I/O on writing the output and saves CPU on both input and output

- First submitting may need more time for setup preparations

- SQL may need more time for setup then a data step

Kurt_Bremser
Super User

Also keep in mind that your test data and SQL steps follow the creation of the dataset immediately. This means that the dataset will still be contained in the system's file cache, so no physical reads from storage have to be done; this explains the near-zero difference between real and CPU time.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1028 views
  • 1 like
  • 4 in conversation