<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic DATA Step: Subsetting Inside Versus Outside Parentheses in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711928#M219384</link>
    <description>&lt;P&gt;For a SAS data set like&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i;
	array i(5000) i1-i5000;
	do j=1 to 5000;
		do k=1 to 5000;
			i(k)=rannor(1);
		end;
		output;
	end;
	drop j k;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;one may be able to subset via KEEP and WHERE as follows.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i1;
	set i;
	where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
	keep i1-i1000;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The DATA above will apply WHERE first but read and write all 5,000 variables before applying KEEP. There may be three more versions to do this as follows.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i2;
	set i(where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
	keep i1-i1000;
run;

data i3;
	set i(keep=i1-i1000);
	where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
run;

data i4;
	set i(keep=i1-i1000 where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;1. I thought the second version will be less efficient than the first version because it implicitly takes one more step before the explicit DATA, so I have tried to avoid this unless I need separate WHEREs for multiple sets (for example, before merging multiple data sets).&lt;/P&gt;&lt;P&gt;2. The third version introduces one more implicit step but doesn't read and write all the 5,000 variables, so I think there will be a trade-off but am not sure.&lt;/P&gt;&lt;P&gt;3. The fourth version, like the third version, applies KEEP first and then WHERE but due to a different reason.&lt;/P&gt;&lt;P&gt;Or one can also consider PROC SQL unless the 1,000 variables above.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
	create table i5 as
	select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
	from i where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
quit;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;In this case, PROC SQL may only consider the 10 variables stated so will be useful unless sequential access (such as LAG) is required.&lt;/P&gt;&lt;P&gt;Many documents say IF and WHERE are different, but &lt;STRONG&gt;I am not sure whether there is an important performance difference between using parentheses or not (if there is nothing more to be considered).&lt;/STRONG&gt; It seems there is a trade-off between introducing one more implicit step and not rewriting all the unnecessary variables (or the performance may be only marginally different, as the following log shows).&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;1    data i;
2        array i(5000) i1-i5000;
3        do j=1 to 5000;
4            do k=1 to 5000;
5                i(k)=rannor(1);
6            end;
7            output;
8        end;
9        drop j k;
10   run;

NOTE: The data set WORK.I has 5000 observations and 5000
      variables.
NOTE: DATA statement used (Total process time):
      real time           1.70 seconds
      cpu time            1.68 seconds


11
12   data i1;
13       set i;
14       where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
15       keep i1-i1000;
16   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I1 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


17
18   data i2;
19       set i(where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
20       keep i1-i1000;
21   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I2 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


22
23   data i3;
24       set i(keep=i1-i1000);
25       where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
26   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I3 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.17 seconds


27
28   data i4;
29       set i(keep=i1-i1000 where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
30   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I4 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.15 seconds
      cpu time            0.15 seconds


31   proc sql;
32       create table i5 as
33       select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
34       from i where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
NOTE: Table WORK.I5 created, with 1207 rows and 10 columns.

35   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Thanks for all your help.&lt;/P&gt;</description>
    <pubDate>Sun, 17 Jan 2021 09:46:28 GMT</pubDate>
    <dc:creator>Junyong</dc:creator>
    <dc:date>2021-01-17T09:46:28Z</dc:date>
    <item>
      <title>DATA Step: Subsetting Inside Versus Outside Parentheses</title>
      <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711928#M219384</link>
      <description>&lt;P&gt;For a SAS data set like&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i;
	array i(5000) i1-i5000;
	do j=1 to 5000;
		do k=1 to 5000;
			i(k)=rannor(1);
		end;
		output;
	end;
	drop j k;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;one may be able to subset via KEEP and WHERE as follows.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i1;
	set i;
	where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
	keep i1-i1000;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The DATA above will apply WHERE first but read and write all 5,000 variables before applying KEEP. There may be three more versions to do this as follows.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i2;
	set i(where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
	keep i1-i1000;
run;

data i3;
	set i(keep=i1-i1000);
	where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
run;

data i4;
	set i(keep=i1-i1000 where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;1. I thought the second version will be less efficient than the first version because it implicitly takes one more step before the explicit DATA, so I have tried to avoid this unless I need separate WHEREs for multiple sets (for example, before merging multiple data sets).&lt;/P&gt;&lt;P&gt;2. The third version introduces one more implicit step but doesn't read and write all the 5,000 variables, so I think there will be a trade-off but am not sure.&lt;/P&gt;&lt;P&gt;3. The fourth version, like the third version, applies KEEP first and then WHERE but due to a different reason.&lt;/P&gt;&lt;P&gt;Or one can also consider PROC SQL unless the 1,000 variables above.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
	create table i5 as
	select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
	from i where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
quit;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;In this case, PROC SQL may only consider the 10 variables stated so will be useful unless sequential access (such as LAG) is required.&lt;/P&gt;&lt;P&gt;Many documents say IF and WHERE are different, but &lt;STRONG&gt;I am not sure whether there is an important performance difference between using parentheses or not (if there is nothing more to be considered).&lt;/STRONG&gt; It seems there is a trade-off between introducing one more implicit step and not rewriting all the unnecessary variables (or the performance may be only marginally different, as the following log shows).&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;1    data i;
2        array i(5000) i1-i5000;
3        do j=1 to 5000;
4            do k=1 to 5000;
5                i(k)=rannor(1);
6            end;
7            output;
8        end;
9        drop j k;
10   run;

NOTE: The data set WORK.I has 5000 observations and 5000
      variables.
NOTE: DATA statement used (Total process time):
      real time           1.70 seconds
      cpu time            1.68 seconds


11
12   data i1;
13       set i;
14       where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
15       keep i1-i1000;
16   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I1 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


17
18   data i2;
19       set i(where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
20       keep i1-i1000;
21   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I2 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds


22
23   data i3;
24       set i(keep=i1-i1000);
25       where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
26   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I3 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      cpu time            0.17 seconds


27
28   data i4;
29       set i(keep=i1-i1000 where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
30   run;

NOTE: There were 1207 observations read from the data set
      WORK.I.
      WHERE (i1&amp;gt;0) and (i2&amp;gt;0);
NOTE: The data set WORK.I4 has 1207 observations and 1000
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.15 seconds
      cpu time            0.15 seconds


31   proc sql;
32       create table i5 as
33       select i1,i2,i3,i4,i5,i6,i7,i8,i9,i10
34       from i where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
NOTE: Table WORK.I5 created, with 1207 rows and 10 columns.

35   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.17 seconds
      cpu time            0.15 seconds&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Thanks for all your help.&lt;/P&gt;</description>
      <pubDate>Sun, 17 Jan 2021 09:46:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711928#M219384</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2021-01-17T09:46:28Z</dc:date>
    </item>
    <item>
      <title>Re: DATA Step: Subsetting Inside Versus Outside Parentheses</title>
      <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711929#M219385</link>
      <description>&lt;P&gt;I'm not sure what the question is here? &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 17 Jan 2021 09:28:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711929#M219385</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2021-01-17T09:28:25Z</dc:date>
    </item>
    <item>
      <title>Re: DATA Step: Subsetting Inside Versus Outside Parentheses</title>
      <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711932#M219388</link>
      <description>&lt;P&gt;My apologies—in short, I wonder if there is an important performance gain or loss in the following four different approaches (ceteris paribus).&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i1;
	set i;
	where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
	keep i1-i1000;
run;

data i2;
	set i(where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
	keep i1-i1000;
run;

data i3;
	set i(keep=i1-i1000);
	where i1&amp;gt;0 &amp;amp; i2&amp;gt;0;
run;

data i4;
	set i(keep=i1-i1000 where=(i1&amp;gt;0 &amp;amp; i2&amp;gt;0));
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;It seems I1 rewrites all the 5,000 variables before KEEPing while I3 doesn't, but I3 takes two steps while I1 does only one.&lt;/P&gt;</description>
      <pubDate>Sun, 17 Jan 2021 09:58:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711932#M219388</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2021-01-17T09:58:28Z</dc:date>
    </item>
    <item>
      <title>Re: DATA Step: Subsetting Inside Versus Outside Parentheses</title>
      <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711933#M219389</link>
      <description>&lt;P&gt;The deciding factor in a pure sequential processing of a dataset is always the performance of your storage. Reading and writing take most of the time, the CPU outperforms storage by orders of magnitude.&lt;/P&gt;
&lt;P&gt;Given the structure of SAS datasets, all your steps need to read the whole dataset physically, so you will find no significant difference between all those steps.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Your question is not about parentheses, it is about the difference of the WHERE and KEEP &lt;EM&gt;statements&lt;/EM&gt; vs. the WHERE= and KEEP= &lt;EM&gt;dataset options&lt;/EM&gt;.&lt;/P&gt;
&lt;P&gt;The WHERE statement and dataset option are applied to the input; the &lt;EM&gt;statement&lt;/EM&gt; &lt;EM&gt;globally&lt;/EM&gt; to &lt;EM&gt;all&lt;/EM&gt; datasets read, while the &lt;EM&gt;dataset option&lt;/EM&gt; is applied &lt;EM&gt;individually&lt;/EM&gt; to each dataset.&lt;/P&gt;
&lt;P&gt;The KEEP &lt;EM&gt;dataset option&lt;/EM&gt; is applied to the &lt;EM&gt;input&lt;/EM&gt;, but the KEEP &lt;EM&gt;statement&lt;/EM&gt; works on the &lt;EM&gt;output&lt;/EM&gt;. This means that the steps with the KEEP statement will have a larger PDV. Once again, given today's performance of CPU and RAM (especially the cache memory contained in the CPU chip), this is insignificant.&lt;/P&gt;</description>
      <pubDate>Sun, 17 Jan 2021 10:01:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711933#M219389</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-01-17T10:01:48Z</dc:date>
    </item>
    <item>
      <title>Re: DATA Step: Subsetting Inside Versus Outside Parentheses</title>
      <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711934#M219390</link>
      <description>&lt;P&gt;As much as I know,&lt;/P&gt;
&lt;P&gt;- WHERE saves CPU time - dealing with less data&lt;/P&gt;
&lt;P&gt;- KEEP saves I/O on writing the output and saves CPU on both input and output&lt;/P&gt;
&lt;P&gt;- First submitting may need more time for setup preparations&lt;/P&gt;
&lt;P&gt;- SQL may need more time for setup then a data step&lt;/P&gt;</description>
      <pubDate>Sun, 17 Jan 2021 10:12:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711934#M219390</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2021-01-17T10:12:03Z</dc:date>
    </item>
    <item>
      <title>Re: DATA Step: Subsetting Inside Versus Outside Parentheses</title>
      <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711935#M219391</link>
      <description>&lt;P&gt;The answer is: &lt;STRONG&gt;In general:&lt;/STRONG&gt; It depends.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;First off, there is no difference between the &lt;A href="https://documentation.sas.com/?cdcId=pgmsascdc&amp;amp;cdcVersion=9.4_3.5&amp;amp;docsetId=lestmtsref&amp;amp;docsetTarget=n1xbr9r0s9veq0n137iftzxq4g7e.htm&amp;amp;locale=en" target="_self"&gt;Where Statement&lt;/A&gt; and the &lt;A href="https://documentation.sas.com/?cdcId=pgmsascdc&amp;amp;cdcVersion=9.4_3.5&amp;amp;docsetId=ledsoptsref&amp;amp;docsetTarget=p0ny9o8t8hc5zen1qn3ft9dhtsxx.htm&amp;amp;locale=en" target="_self"&gt;Where= Data Set Option&lt;/A&gt;. Both are applied to the input data.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There &lt;EM&gt;is&amp;nbsp;&lt;/EM&gt;a difference between the &lt;A href="https://documentation.sas.com/?cdcId=pgmsascdc&amp;amp;cdcVersion=9.4_3.5&amp;amp;docsetId=lestmtsref&amp;amp;docsetTarget=n1nnrzzsw6rzrjn1p2jfky6pdv23.htm&amp;amp;locale=en" target="_self"&gt;Keep Statement&lt;/A&gt; and the &lt;A href="https://documentation.sas.com/?cdcId=pgmsascdc&amp;amp;cdcVersion=9.4_3.5&amp;amp;docsetId=ledsoptsref&amp;amp;docsetTarget=p0vw9lyyxk1cxkn0zzfemrsr3t9a.htm&amp;amp;locale=en" target="_self"&gt;Keep= Data Set Option&lt;/A&gt;. The Keep= Data Set Option is applied to the input data. So in i2, all the variables are read in, while in i3 only i1-i1000 are read.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In your example:&amp;nbsp;&lt;/STRONG&gt;No difference. Since you already have only i1-i1000 in your input data, there will be no significant performance difference between the data steps.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, try throwing another 1000 variables (like z1-z1000) in there and increase the number of observations. Then, you'll see a performance difference &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 17 Jan 2021 10:19:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711935#M219391</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2021-01-17T10:19:10Z</dc:date>
    </item>
    <item>
      <title>Re: DATA Step: Subsetting Inside Versus Outside Parentheses</title>
      <link>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711936#M219392</link>
      <description>&lt;P&gt;Also keep in mind that your test data and SQL steps follow the creation of the dataset immediately. This means that the dataset will still be contained in the system's file cache, so no physical reads from storage have to be done; this explains the near-zero difference between real and CPU time.&lt;/P&gt;</description>
      <pubDate>Sun, 17 Jan 2021 10:24:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/DATA-Step-Subsetting-Inside-Versus-Outside-Parentheses/m-p/711936#M219392</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2021-01-17T10:24:27Z</dc:date>
    </item>
  </channel>
</rss>

