BookmarkSubscribeRSS Feed
tingo
Calcite | Level 5

Hi,

while running different regression analyses on a 17GB dataset (aprox. 900k instances and 2500 variables) using PROC REG, I realized there is a huge difference in terms of time-to-result between the following two programs:

 

Without Output statement: This takes aprox 1 minute to conclude.

proc reg data=training outest=pars noprint;
by v1; /*this by-variable has only 2 levels*/
model vm=&vars / selection=stepwise adjrsq aic include=1 sle=0.001 sls=0.005 ; /*&vars contains the 2500 variables*/
weight invvar ;
run;
quit;

 

With Output statement: This, however, takes aprox 1 hour to conclude.

proc reg data=training noprint;
by v1; /*this by-variable has only 2 levels*/
model vm=&vars / selection=stepwise adjrsq aic include=1 sle=0.001 sls=0.005 ; /*&vars contains the 2500 variables*/
weight invvar;
output out=training rstudent=rstud h=h ;
run;
quit;

 

Notice that the only differences are that in the first code a outest= dataset (pars) is required but there is not an Output statement, while in the second one an output statement is included but the outest= dataset is not required.

 

Why just including the Output statement mades such a big difference? Writing data to the out= dataset alone should not account for such a difference, since the data step that builds the 17GB training dataset takes just 5 minutes to conclude. May it be the computation of studentized residuals and leverages what takes so long?

 

By the way, how would you shorten the time required to produce such an out= dataset?

 

Thanks! 

3 REPLIES 3
stat_sas
Ammonite | Level 13

Hi,

 

Try after changing the name in output statement something like training1.You are putting the same name being used in proc reg in output statement.

 

output out=training1 rstudent=rstud h=h ;

Rick_SAS
SAS Super FREQ

Even better:

output out=training1(keep=rstud h) rstudent=rstud h=h ;

You can merge with the original data later, if necessary.

There will be extra time needed for the leverage and rstudent computation.  If you want to see an apples-to-apples comparison, try using the INFLUENCE option on the first MODEL statement.   Then the difference in times will be the I/O.

tingo
Calcite | Level 5

Thanks stat_sas and Rick_SAS for your suggestions.

 

To try them, I downsized the data to a 5GB dataset with 250k obs and still 2500 vars.

 

Now, executing the previous proc reg code without the output statement takes 18 seconds, while it takes aprox 2 minutes and 5 seconds with the output statement (requesting both rstudent and h), regardless of whether the out= dataset has or not the same name as the input data= dataset. So even at this data size there is a significant increase in the computational time required by the output statement.

 

To further explore whether this increase is caused by the computation of studentized residuals and leverages or by the I/O, I tried different output statements:

 

1) output out=training1 h=h;  Took aprox 2 minutes

2) output out=training1 rstudent=rstud;  Also took 2 minutes 

3) output out=training1 (keep=rstud h) rstudent=rstud h=h; Took 1 minute and 50 seconds
4) output out=training1; Took 45 seconds

 

Using the INFLUENCE option on the code without the OUTPUT statement made no difference since the NOPRINT option was on. But I think the comparison would not be fair removing NOPRINT as it then would take a long time writing influence statistics to the results window (250k lines of 2500 dfbetas plus the other statistics...).

 

Anyway, the comparison between the output statements 1-4 above seems to point to the computation of the leverages and residuals taking the biggest part of the time - it took just 25-30 seconds more (wrt the code without output statement) to produce the whole out= dataset (with the 2500 vars), but aprox 1 minute and a half more to compute the h and rstudent options and write them down to the out= dataset (even just these 2 columns). It looks like it is the computation of the H matrix diagonal (which is not actually needed in the stepwise selection process but is necessary for studentized residuals and leverages) what makes the difference in elapsed time.

 

I will appreciate any comments on these findings as well as further ideas to obtain studentized residuals in a shorter time (perhaps proc IML...?)

 

Thanks again!      
 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1924 views
  • 2 likes
  • 3 in conversation