<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: PROC REG Output statement takes too long in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330133#M17436</link>
    <description>&lt;P&gt;Even better:&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;output out=&lt;/SPAN&gt;&lt;STRONG&gt;training1(keep=rstud h)&lt;/STRONG&gt;&lt;SPAN&gt; rstudent=rstud h=h ;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;You can merge with the original data later, if necessary.&lt;/P&gt;
&lt;P&gt;There will be extra time needed for the leverage and rstudent computation. &amp;nbsp;If you want to see an apples-to-apples comparison, try using the INFLUENCE option on the first MODEL statement. &amp;nbsp; Then the difference in times will be the I/O.&lt;/P&gt;</description>
    <pubDate>Mon, 06 Feb 2017 10:56:07 GMT</pubDate>
    <dc:creator>Rick_SAS</dc:creator>
    <dc:date>2017-02-06T10:56:07Z</dc:date>
    <item>
      <title>PROC REG Output statement takes too long</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330072#M17431</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;while running different regression analyses on a 17GB dataset (aprox. 900k instances and 2500 variables) using PROC REG, I realized there is a huge difference in terms of time-to-result between the following two programs:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;U&gt;Without Output statement:&amp;nbsp;This takes aprox 1 minute to conclude.&lt;/U&gt;&lt;/P&gt;&lt;P&gt;proc reg data=training&amp;nbsp;outest=pars noprint;&lt;BR /&gt;by v1; /*this by-variable has only 2 levels*/&lt;BR /&gt;model vm=&amp;amp;vars /&amp;nbsp;selection=stepwise adjrsq aic include=1 sle=0.001 sls=0.005 ; /*&amp;amp;vars contains the 2500 variables*/&lt;BR /&gt;weight invvar ;&lt;BR /&gt;run;&lt;BR /&gt;quit;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;U&gt;With Output statement:&amp;nbsp;This, however, takes aprox 1 hour&amp;nbsp;to conclude.&lt;/U&gt;&lt;/P&gt;&lt;P&gt;proc reg data=training noprint;&lt;BR /&gt;by v1;&amp;nbsp;&lt;SPAN&gt;/*this by-variable has only 2 levels*/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;model vm=&amp;amp;vars /&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;selection=stepwise adjrsq aic include=1 sle=0.001 sls=0.005 ; /*&amp;amp;vars contains the 2500 variables*/&lt;/SPAN&gt;&lt;BR /&gt;weight invvar;&lt;BR /&gt;output out=training rstudent=rstud h=h ;&lt;BR /&gt;run;&lt;BR /&gt;quit;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Notice that the only differences are that in the first code a outest= dataset (pars) is required but there is not an Output statement, while in the second one an output statement is included&amp;nbsp;but the outest= dataset is not required.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Why&amp;nbsp;just including the Output statement mades such a big difference? Writing data to the out= dataset alone should not&amp;nbsp;account for such a difference, since the data step that builds&amp;nbsp;the 17GB training dataset takes just 5 minutes to conclude. May it be the computation of studentized residuals and leverages what takes so long?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;By the way,&amp;nbsp;how would you shorten the time required to produce such an out= dataset?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Feb 2017 00:52:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330072#M17431</guid>
      <dc:creator>tingo</dc:creator>
      <dc:date>2017-02-06T00:52:34Z</dc:date>
    </item>
    <item>
      <title>Re: PROC REG Output statement takes too long</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330080#M17432</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Try after changing the name in output statement something like training1.You are putting the same name being used in proc reg in output statement.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;output out=&lt;STRONG&gt;training1&lt;/STRONG&gt; rstudent=rstud h=h ;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Feb 2017 02:09:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330080#M17432</guid>
      <dc:creator>stat_sas</dc:creator>
      <dc:date>2017-02-06T02:09:33Z</dc:date>
    </item>
    <item>
      <title>Re: PROC REG Output statement takes too long</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330133#M17436</link>
      <description>&lt;P&gt;Even better:&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;output out=&lt;/SPAN&gt;&lt;STRONG&gt;training1(keep=rstud h)&lt;/STRONG&gt;&lt;SPAN&gt; rstudent=rstud h=h ;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;You can merge with the original data later, if necessary.&lt;/P&gt;
&lt;P&gt;There will be extra time needed for the leverage and rstudent computation. &amp;nbsp;If you want to see an apples-to-apples comparison, try using the INFLUENCE option on the first MODEL statement. &amp;nbsp; Then the difference in times will be the I/O.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Feb 2017 10:56:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330133#M17436</guid>
      <dc:creator>Rick_SAS</dc:creator>
      <dc:date>2017-02-06T10:56:07Z</dc:date>
    </item>
    <item>
      <title>Re: PROC REG Output statement takes too long</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330364#M17458</link>
      <description>&lt;P&gt;Thanks stat_sas and Rick_SAS for your suggestions.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To try them, I downsized the data to a 5GB dataset with 250k obs and still 2500 vars.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now, executing the previous proc reg code without the output statement takes 18 seconds, while it takes aprox 2 minutes and 5 seconds with the output statement (requesting both rstudent and h), regardless of whether the out= dataset has or not&amp;nbsp;the same name as&amp;nbsp;the input data= dataset. So even at this data size there is a significant increase in the computational time required by the output statement.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To further explore whether this increase is caused&amp;nbsp;by the computation of studentized residuals and leverages or by the I/O, I tried different output statements:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1) output out=training1 h=h; &amp;nbsp;Took aprox 2 minutes&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2) output out=training1 rstudent=rstud; &amp;nbsp;Also took 2 minutes&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;3) output out=training1 (keep=rstud h) rstudent=rstud h=h;&amp;nbsp;Took 1 minute and 50 seconds&lt;BR /&gt;4)&amp;nbsp;&lt;SPAN&gt;output out=training1; Took 45 seconds&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Using the INFLUENCE&amp;nbsp;option on the code without the OUTPUT statement made no difference since the NOPRINT&amp;nbsp;option was on. But I think the comparison would not be fair removing NOPRINT as it then would take a long time writing influence statistics to the results window (250k lines of&amp;nbsp;2500 dfbetas plus the other statistics...).&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Anyway, the comparison between the output statements 1-4 above seems to point to the computation of the leverages and residuals taking the biggest part of the time - it took just 25-30 seconds more (wrt the code without output statement) to produce the whole out= dataset (with the 2500 vars), but aprox&amp;nbsp;1 minute and a half&amp;nbsp;more to compute the h and rstudent options and write them down to the out= dataset (even just these 2 columns). It looks like it is the computation of the H matrix diagonal (which is not actually needed in the stepwise selection process but is necessary&amp;nbsp;for studentized residuals and leverages) what makes the difference in elapsed time. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I will appreciate any&amp;nbsp;comments on these findings as well as further ideas to obtain studentized residuals in a shorter time (perhaps proc IML...?)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks again! &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;BR /&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Feb 2017 00:02:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/PROC-REG-Output-statement-takes-too-long/m-p/330364#M17458</guid>
      <dc:creator>tingo</dc:creator>
      <dc:date>2017-02-07T00:02:55Z</dc:date>
    </item>
  </channel>
</rss>

