Help using Base SAS procedures

How to improve execution speed of program in Base SAS?

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 81
Accepted Solution

How to improve execution speed of program in Base SAS?

I need help improving the execution speed of a numerical optimization program (see below) I've written in Base SAS.

My program works fine - it successfully calculates parameters for a penalized logistic regression using a coordinate descent routine. The program is based on the algorithm used in the glmnet package in R, which is written in Fortran. Running glmnet in R is lightning fast, however my code takes significantly longer.

With the help of others I've streamlined my program considerably - it reads my data into a two-dimensional temporary array and processes the array in a single data step before reading out the final coefficient estimates to an output dataset. So I/O processing ought to be minimized. And my calculations should be conducted in memory, which ought to minimize execution time.

For trial runs I'm running the program on the publically available diabetes dataset (http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt) after standardizing the 10 predictor variables x1-x10, adding an intercept constant x0=1, and creating a binary response variable y_gt140=1 if y>140, else y_gt140=0.

Running the code below takes ~5 minutes in SAS, compared with < 1 second to run glmnet in R.

What else can I do to improve the execution time of my program?

Thanks in advance!

**********************************************************************

*    Logistic regression coordinate descent code for elastic net      *

**********************************************************************;

%let nobs=442;

%let numvars=10;

%let numvars2=11;

%let numiter=100000;

%let lambda_list=.1;

%let alpha_list=.5;

%macro parmlist; %do i = 0 %to &numvars; &&p&i %end; %mend parmlist;

%macro meanlist; %do i = 0 %to &numvars; &&mean&i %end; %mend meanlist;

%macro stdlist; %do i = 0 %to &numvars; &&std&i %end; %mend stdlist;

data coord_descent_output (keep=alpha lambda parm_unstnd_0-parm_unstnd_&numvars2.);

     array xx[0:&numvars.] x0-x&numvars.

     array x_[&nobs.,0:&numvars.] _temporary_;

     array y_[&nobs] _temporary_;

     array mean_[0:&numvars.] (%meanlist);

     array std_[0:&numvars.] (%stdlist);

     array parm_unstnd_[0:&numvars2.] (&numvars2.*0);

*** Load data into two dimensional array ***;

     do _n_ = 1 to &nobs.;

             set work.diabetes_stnd_array2 nobs=nobs;

             do j=0 to &numvars.;

          x_[_n_,j] = xx;

          y_[_n_] = y_gt140 ;

             end;

     end;

     do alpha=&alpha_list;

     do lambda=&lambda_list;

     gamma=alpha*lambda;

     array p_[0:&numvars.] (&numvars2.*1);

     do i=1 to &numiter;

           do j=0 to &numvars;        

        z = 0;

        sum_wtx_sq = 0;

               do _nn_ = 1 to &nobs;

                     yhat = p_[0];

                     do k=1 to &numvars;

                           yhat = sum(yhat, p_*x_[_nn_,k]);

                     end;

                     proby = 1/(1 + exp(-yhat));

                     if proby <= .00001 then do;

                           proby = 0;

                           weight = .00001;

                     end;

                     else if proby >= .99999 then do;

                           proby = 1;

                           weight = .00001;

                     end;

                     else weight = proby*(1 - proby);

                     z = sum(z, (x_[_nn_,j]*(y_[_nn_] - proby) + weight*p_*(x_[_nn_,j])**2));

                     sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_nn_,j])**2);

               end;

                if j=0 then do;

                     p_ = z/sum_wtx_sq;

                end;

                else if j>0 then do;

                     if (z/&nobs > 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs - gamma)/(sum_wtx_sq/&nobs + lambda - gamma);

                     else if (z/&nobs < 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs + gamma)/(sum_wtx_sq/&nobs + lambda - gamma);

                     else if gamma >= abs(z/&nobs) then p_ = 0;

                end;

          end; * Inner loop ;

     end; * Outer loop ;

     *** Calculate "destandardized" regression coeff.'s from standardized predictor variables (Mean(x)= 0, Var(x)=1). ******;

  parm_unstnd_[0] = p_[0];

     do l=1 to &numvars.;

           parm_unstnd_[0] = parm_unstnd_[0] - p_*mean_/std_;

           parm_unstnd_ = p_/std_;

     end;

     output coord_descent_output;

     put "Final parm_unstnd_

  • ="
  • parm_unstnd_
  • ;
  •      end;

         end;

    run;


    Accepted Solutions
    Solution
    ‎04-24-2015 12:20 PM
    Trusted Advisor
    Posts: 1,301

    Re: How to improve execution speed of program in Base SAS?

    Yes, but this is the main factor (I think) behind why your code is running considerably longer.  Since R will most likely stop far before it reaches the maximum iterations.

    I am not going to speak to the validity of the following as a stopping criteria, but checking the value of sum_wtx_sq against it's lag for no change is valid for at least this test.

    I made a few minor edits to you code, below, but nothing that makes any real change to the algorithm (beyond adding the stopping criteria I mention above).

    data coord_descent;

    array x_[&nobs.,0:&numvars.] _temporary_;

    array y_[&nobs.]           _temporary_;

    /* cast data set to arrays x[i,j], y*/

    retain x0-x&numvars.;

    _adr1=addrlong(x0);

    do _n_=1 to &nobs.;

       set work.diabetes_stnd_array2;

       y_[_n_]=y_gt140;

       call pokelong( peekclong(_adr1), addrlong(x_[_n_,0]), %eval(8*%eval(&numvars.+1)) );

       end;

    array p_[0:&numvars.];

    _adr2=addrlong(p_[0]);

    _ones=repeat(put(1, rb8.), &numvars.);

    array parm_unstnd_[0:&numvars.];

    _adr3=addrlong(parm_unstnd_[0]);

    _zero=repeat(put(0, rb8.), &numvars.);

    array mean_[0:&numvars.] _temporary_ (%meanlist);

    array std_[0:&numvars.] _temporary_ (%stdlist);

    do alpha=&alpha_list.;

       do lambda=&lambda_list.;

          gamma = alpha*lambda;

          call pokelong( _ones, _adr2, %eval(8*%eval(&numvars.+1)) );

          call pokelong( _zero, _adr3, %eval(8*%eval(&numvars.+1)) );

       do i=1 to &numiter. until (lag(sum_wtx_sq) eq sum_wtx_sq);

          do j=0 to &numvars.;

         z = 0;

      sum_wtx_sq = 0;

      do _n_=1 to &nobs.;

        yhat = p_[0];

        do k=1 to &numvars.;

           yhat = sum(yhat, p_*x_[_n_, k]);

           end;

        proby = 1/(1 + exp(-yhat));

        if 0.00001 <= proby >= 0.99999 then do;

           proby = round(proby, 1);

       weight = 0.00001;

           end;

        else weight = proby*(1 - proby);

        z = sum(z, (x_[_n_,j]*(y_[_n_] - proby) + weight*p_*(x_[_n_,j])**2));

        sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_n_,j])**2);

        end; *endLoop: _n_= [nobs];

      if j>0 then do;

        if gamma<abs(z/&nobs.) then do;

                      if z>0 then p_ = (z/&nobs. - gamma);

       else p_ = (z/&nobs. + gamma);

           p_ = p_/(sum_wtx_sq/&nobs. + lambda - gamma);

       end;

        else p_=0;

        end;

                else p_ = z/sum_wtx_sq;

         end; *endLoop: j= [numvars];

          end; *endLoop: i= [numiter];

      parm_unstnd_[0] = p_[0];

             do _n_=1 to &numvars.;

         parm_unstnd_[0] = parm_unstnd_[0] - ((p_[_n_]*mean_[_n_])/std_[_n_]);

      parm_unstnd_[_n_] = p_[_n_]/std_[_n_];

         end;

      output;

      put "Final parm_unstnd_

  • =" parm_unstnd_
  • ;
  •    end; *endLoop: lambda= [lambda_list];

       end; *endLoop: alpha= [alpha_list];

       keep alpha lambda parm_unstnd_:;

       stop;

    run;

    Final parm_unstnd_

  • =-6.968638061 0 0.0910300569 0.0147581702 0 0 -0.013146089 0 0.8256638121 0 0
  • NOTE: There were 442 observations read from the data set WORK.DIABETES_STND_ARRAY2.

    NOTE: The data set WORK.COORD_DESCENT has 1 observations and 13 variables.

    NOTE: DATA statement used (Total process time):

          real time           0.06 seconds

          cpu time            0.05 seconds

    R:

    library("glmnet")

    set.seed(20)

    foo <- read.table("http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", header=TRUE, sep="\t", strip.white=TRUE)

    x <- as.matrix(foo[1:10])

    y <- ifelse(foo$Y > 140, 1, 0)

    system.time(bar <- glmnet(x, y, family = "binomial", alpha = 0.5) )

       user  system elapsed

       0.03    0.00    0.03

    print(coef(bar, s=0.1))

    11 x 1 sparse Matrix of class "dgCMatrix"

                          1

    (Intercept) -6.97872002

    AGE          .        

    SEX          .        

    BMI          0.09119253

    BP           0.01478218

    S1           .        

    S2           .        

    S3          -0.01316737

    S4           .        

    S5           0.82665323

    S6           .        

    View solution in original post


    All Replies
    Super User
    Posts: 11,343

    Re: How to improve execution speed of program in Base SAS?

    Just seeing a two dimensional array makes me ask if you have access to SAS/IML. It is designed for such things in general.

    Frequent Contributor
    Posts: 81

    Re: How to improve execution speed of program in Base SAS?

    No - my company only has SAS EG loaded on the server. I'm hoping a fast & efficient program can be wriiten in base SAS without bringing in other modules.

    Super User
    Posts: 11,343

    Re: How to improve execution speed of program in Base SAS?

    I'm not quite sure but I think this line:

    do i=1 to &numiter;

    may be the issue. It appears that you are iterating the exact same code 100000 times in the second loop controlled by j.

    Do you get different results for different values of &numiter with all else the same?

    Super User
    Posts: 19,868

    Re: How to improve execution speed of program in Base SAS?

    Echoing @ballardw suggestion, usually there is some option to check if parameters have stabilized so you don't need to keep repeating. It usually ends up being a do while loop of some kind.

    Frequent Contributor
    Posts: 81

    Re: How to improve execution speed of program in Base SAS?

    Right,  I am converging to a good solution after about 1,000 iterations which takes only about 3 seconds. I chose 100,000 iterations since that's the baseline # of iterations in the R glmnet routine.

    However, for much larger datasets which I foresee I'll be running in the future, the runtime could conceivably take hours. I don't know, maybe this is the best I can do in Base SAS.

    I just saw this article by Rick Wicklin which may help out even if I'm not using proc iml:

    http://blogs.sas.com/content/iml/2013/05/15/vectorize-computations.html

    Super User
    Posts: 7,854

    Re: How to improve execution speed of program in Base SAS?

    Be aware that the data step compiler will not be on par with something like gcc or a FORTRAN compiler when it comes to optimizing code. There's too much runtime checking built in for what is basically interpreted code, so my guess is that you will never reach the performance of R for such operations.

    I'm very fond of SAS, but if ever one of my "customers" wants to try solving a problem with R, he/she will have my full support in getting the SAS data warehouse data into R.

    "How often have I told you blokes to use the right tool?" - LtCmdr Montgomery Scott

    ---------------------------------------------------------------------------------------------
    Maxims of Maximally Efficient SAS Programmers
    Frequent Contributor
    Posts: 81

    Re: How to improve execution speed of program in Base SAS?

    Posted in reply to KurtBremser

    Thanks Kurt - is there a way to curtail the runtime checking in SAS to increase performance?

    Super User
    Posts: 7,854

    Re: How to improve execution speed of program in Base SAS?

    No. The data step compiler does not have any optimization options, as data steps are usually heavily I/O bound, and SAS provides prefabricated and optimized procedures for most statistic tasks which tend to be CPU bound.

    ---------------------------------------------------------------------------------------------
    Maxims of Maximally Efficient SAS Programmers
    Trusted Advisor
    Posts: 1,301

    Re: How to improve execution speed of program in Base SAS?

    You are missing any stopping criteria for the coordinate decent algorithm to stop iterating when it has converged for a given lambda.

    Frequent Contributor
    Posts: 81

    Re: How to improve execution speed of program in Base SAS?

    Hi FriedEgg -

    No, with the fast approaching deadlines for the global forum I didn't have time to include stopping criteria, or include wrapper code for using 5 or 10-fold cross validation to find optimal values of the alpha or lambda penalty values.

    Instead I concentrated on just getting the algorithm to converge to a solution (:-P) by comparing output with glmnet in R using the diabetes test dataset for logistic and Poisson regressions. The code matches glmnet output reasonably well for ~1,000 iterations (but not perfect). I'm hoping the code is useful as a niche market solution for SAS users without access to Revolution Analytics who want to run a lasso on a dataset too big for R to handle.

    Robert

    Solution
    ‎04-24-2015 12:20 PM
    Trusted Advisor
    Posts: 1,301

    Re: How to improve execution speed of program in Base SAS?

    Yes, but this is the main factor (I think) behind why your code is running considerably longer.  Since R will most likely stop far before it reaches the maximum iterations.

    I am not going to speak to the validity of the following as a stopping criteria, but checking the value of sum_wtx_sq against it's lag for no change is valid for at least this test.

    I made a few minor edits to you code, below, but nothing that makes any real change to the algorithm (beyond adding the stopping criteria I mention above).

    data coord_descent;

    array x_[&nobs.,0:&numvars.] _temporary_;

    array y_[&nobs.]           _temporary_;

    /* cast data set to arrays x[i,j], y*/

    retain x0-x&numvars.;

    _adr1=addrlong(x0);

    do _n_=1 to &nobs.;

       set work.diabetes_stnd_array2;

       y_[_n_]=y_gt140;

       call pokelong( peekclong(_adr1), addrlong(x_[_n_,0]), %eval(8*%eval(&numvars.+1)) );

       end;

    array p_[0:&numvars.];

    _adr2=addrlong(p_[0]);

    _ones=repeat(put(1, rb8.), &numvars.);

    array parm_unstnd_[0:&numvars.];

    _adr3=addrlong(parm_unstnd_[0]);

    _zero=repeat(put(0, rb8.), &numvars.);

    array mean_[0:&numvars.] _temporary_ (%meanlist);

    array std_[0:&numvars.] _temporary_ (%stdlist);

    do alpha=&alpha_list.;

       do lambda=&lambda_list.;

          gamma = alpha*lambda;

          call pokelong( _ones, _adr2, %eval(8*%eval(&numvars.+1)) );

          call pokelong( _zero, _adr3, %eval(8*%eval(&numvars.+1)) );

       do i=1 to &numiter. until (lag(sum_wtx_sq) eq sum_wtx_sq);

          do j=0 to &numvars.;

         z = 0;

      sum_wtx_sq = 0;

      do _n_=1 to &nobs.;

        yhat = p_[0];

        do k=1 to &numvars.;

           yhat = sum(yhat, p_*x_[_n_, k]);

           end;

        proby = 1/(1 + exp(-yhat));

        if 0.00001 <= proby >= 0.99999 then do;

           proby = round(proby, 1);

       weight = 0.00001;

           end;

        else weight = proby*(1 - proby);

        z = sum(z, (x_[_n_,j]*(y_[_n_] - proby) + weight*p_*(x_[_n_,j])**2));

        sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_n_,j])**2);

        end; *endLoop: _n_= [nobs];

      if j>0 then do;

        if gamma<abs(z/&nobs.) then do;

                      if z>0 then p_ = (z/&nobs. - gamma);

       else p_ = (z/&nobs. + gamma);

           p_ = p_/(sum_wtx_sq/&nobs. + lambda - gamma);

       end;

        else p_=0;

        end;

                else p_ = z/sum_wtx_sq;

         end; *endLoop: j= [numvars];

          end; *endLoop: i= [numiter];

      parm_unstnd_[0] = p_[0];

             do _n_=1 to &numvars.;

         parm_unstnd_[0] = parm_unstnd_[0] - ((p_[_n_]*mean_[_n_])/std_[_n_]);

      parm_unstnd_[_n_] = p_[_n_]/std_[_n_];

         end;

      output;

      put "Final parm_unstnd_

  • =" parm_unstnd_
  • ;
  •    end; *endLoop: lambda= [lambda_list];

       end; *endLoop: alpha= [alpha_list];

       keep alpha lambda parm_unstnd_:;

       stop;

    run;

    Final parm_unstnd_

  • =-6.968638061 0 0.0910300569 0.0147581702 0 0 -0.013146089 0 0.8256638121 0 0
  • NOTE: There were 442 observations read from the data set WORK.DIABETES_STND_ARRAY2.

    NOTE: The data set WORK.COORD_DESCENT has 1 observations and 13 variables.

    NOTE: DATA statement used (Total process time):

          real time           0.06 seconds

          cpu time            0.05 seconds

    R:

    library("glmnet")

    set.seed(20)

    foo <- read.table("http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", header=TRUE, sep="\t", strip.white=TRUE)

    x <- as.matrix(foo[1:10])

    y <- ifelse(foo$Y > 140, 1, 0)

    system.time(bar <- glmnet(x, y, family = "binomial", alpha = 0.5) )

       user  system elapsed

       0.03    0.00    0.03

    print(coef(bar, s=0.1))

    11 x 1 sparse Matrix of class "dgCMatrix"

                          1

    (Intercept) -6.97872002

    AGE          .        

    SEX          .        

    BMI          0.09119253

    BP           0.01478218

    S1           .        

    S2           .        

    S3          -0.01316737

    S4           .        

    S5           0.82665323

    S6           .        

    Frequent Contributor
    Posts: 81

    Re: How to improve execution speed of program in Base SAS?

    Cool, thanks FriedEgg for the code edits & everyone else for the helpful comments.

    Ah, I missed the fact that glmnet is exiting the loop after the convergence criteria (1E-07) is reached, Ok it makes sense now why I'm seeing the runtime discrepancies.

    There are still a few more improvements to be made to the code. I don't need to continue updating parameters once their estimate is zero.

    Also the inner loop:

                         do k=1 to &numvars;

                   yhat = sum(yhat, p_*x_[_nn_,k]);

                         end;

    is unnecessarily repeating the same product calculation for values that aren't changing during each iteration of the loop do j=0 to &numvars;

    Trusted Advisor
    Posts: 3,215

    Re: How to improve execution speed of program in Base SAS?

    although you do not have IML you cuould have used SAS(R) 9.4 DS2 Language Reference, Fourth Edition with the matrxc package it is part of base.
    No idea it would dot the job. The difference is it should be better dot he job instead of the datamanipulation goal (base SAS). Your are too late for that anyway.

    ---->-- ja karman --<-----
    Trusted Advisor
    Posts: 3,215

    Re: How to improve execution speed of program in Base SAS?

    It helps as shown to eliminate unnecessary loops. Wel done "friedegg"

    Still the datastep and ds2 are build for data-manipulation.  When you can use the "SAS procs" they are build for analyzing/reporting, you are better with those. A better name for "SAS procs"are packages as they are pre_build solutions. However as packages are also used to indicate other programming styles that can be confusing. The matrix package and SQL package within DS2 are examples of those.
    Optimizing code is optimizing resource usage (minimizing)  and using the best code-parts for those. 

    ---->-- ja karman --<-----
    🔒 This topic is solved and locked.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 20 replies
    • 614 views
    • 9 likes
    • 6 in conversation