BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RobF
Quartz | Level 8

I need help improving the execution speed of a numerical optimization program (see below) I've written in Base SAS.

My program works fine - it successfully calculates parameters for a penalized logistic regression using a coordinate descent routine. The program is based on the algorithm used in the glmnet package in R, which is written in Fortran. Running glmnet in R is lightning fast, however my code takes significantly longer.

With the help of others I've streamlined my program considerably - it reads my data into a two-dimensional temporary array and processes the array in a single data step before reading out the final coefficient estimates to an output dataset. So I/O processing ought to be minimized. And my calculations should be conducted in memory, which ought to minimize execution time.

For trial runs I'm running the program on the publically available diabetes dataset (http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt) after standardizing the 10 predictor variables x1-x10, adding an intercept constant x0=1, and creating a binary response variable y_gt140=1 if y>140, else y_gt140=0.

Running the code below takes ~5 minutes in SAS, compared with < 1 second to run glmnet in R.

What else can I do to improve the execution time of my program?

Thanks in advance!

**********************************************************************

*    Logistic regression coordinate descent code for elastic net      *

**********************************************************************;

%let nobs=442;

%let numvars=10;

%let numvars2=11;

%let numiter=100000;

%let lambda_list=.1;

%let alpha_list=.5;

%macro parmlist; %do i = 0 %to &numvars; &&p&i %end; %mend parmlist;

%macro meanlist; %do i = 0 %to &numvars; &&mean&i %end; %mend meanlist;

%macro stdlist; %do i = 0 %to &numvars; &&std&i %end; %mend stdlist;

data coord_descent_output (keep=alpha lambda parm_unstnd_0-parm_unstnd_&numvars2.);

     array xx[0:&numvars.] x0-x&numvars.

     array x_[&nobs.,0:&numvars.] _temporary_;

     array y_[&nobs] _temporary_;

     array mean_[0:&numvars.] (%meanlist);

     array std_[0:&numvars.] (%stdlist);

     array parm_unstnd_[0:&numvars2.] (&numvars2.*0);

*** Load data into two dimensional array ***;

     do _n_ = 1 to &nobs.;

             set work.diabetes_stnd_array2 nobs=nobs;

             do j=0 to &numvars.;

          x_[_n_,j] = xx;

          y_[_n_] = y_gt140 ;

             end;

     end;

     do alpha=&alpha_list;

     do lambda=&lambda_list;

     gamma=alpha*lambda;

     array p_[0:&numvars.] (&numvars2.*1);

     do i=1 to &numiter;

           do j=0 to &numvars;        

        z = 0;

        sum_wtx_sq = 0;

               do _nn_ = 1 to &nobs;

                     yhat = p_[0];

                     do k=1 to &numvars;

                           yhat = sum(yhat, p_*x_[_nn_,k]);

                     end;

                     proby = 1/(1 + exp(-yhat));

                     if proby <= .00001 then do;

                           proby = 0;

                           weight = .00001;

                     end;

                     else if proby >= .99999 then do;

                           proby = 1;

                           weight = .00001;

                     end;

                     else weight = proby*(1 - proby);

                     z = sum(z, (x_[_nn_,j]*(y_[_nn_] - proby) + weight*p_*(x_[_nn_,j])**2));

                     sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_nn_,j])**2);

               end;

                if j=0 then do;

                     p_ = z/sum_wtx_sq;

                end;

                else if j>0 then do;

                     if (z/&nobs > 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs - gamma)/(sum_wtx_sq/&nobs + lambda - gamma);

                     else if (z/&nobs < 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs + gamma)/(sum_wtx_sq/&nobs + lambda - gamma);

                     else if gamma >= abs(z/&nobs) then p_ = 0;

                end;

          end; * Inner loop ;

     end; * Outer loop ;

     *** Calculate "destandardized" regression coeff.'s from standardized predictor variables (Mean(x)= 0, Var(x)=1). ******;

  parm_unstnd_[0] = p_[0];

     do l=1 to &numvars.;

           parm_unstnd_[0] = parm_unstnd_[0] - p_*mean_/std_;

           parm_unstnd_ = p_/std_;

     end;

     output coord_descent_output;

     put "Final parm_unstnd_

  • ="
  • parm_unstnd_
  • ;
  •      end;

         end;

    run;

    1 ACCEPTED SOLUTION

    Accepted Solutions
    FriedEgg
    SAS Employee

    Yes, but this is the main factor (I think) behind why your code is running considerably longer.  Since R will most likely stop far before it reaches the maximum iterations.

    I am not going to speak to the validity of the following as a stopping criteria, but checking the value of sum_wtx_sq against it's lag for no change is valid for at least this test.

    I made a few minor edits to you code, below, but nothing that makes any real change to the algorithm (beyond adding the stopping criteria I mention above).

    data coord_descent;

    array x_[&nobs.,0:&numvars.] _temporary_;

    array y_[&nobs.]           _temporary_;

    /* cast data set to arrays x[i,j], y*/

    retain x0-x&numvars.;

    _adr1=addrlong(x0);

    do _n_=1 to &nobs.;

       set work.diabetes_stnd_array2;

       y_[_n_]=y_gt140;

       call pokelong( peekclong(_adr1), addrlong(x_[_n_,0]), %eval(8*%eval(&numvars.+1)) );

       end;

    array p_[0:&numvars.];

    _adr2=addrlong(p_[0]);

    _ones=repeat(put(1, rb8.), &numvars.);

    array parm_unstnd_[0:&numvars.];

    _adr3=addrlong(parm_unstnd_[0]);

    _zero=repeat(put(0, rb8.), &numvars.);

    array mean_[0:&numvars.] _temporary_ (%meanlist);

    array std_[0:&numvars.] _temporary_ (%stdlist);

    do alpha=&alpha_list.;

       do lambda=&lambda_list.;

          gamma = alpha*lambda;

          call pokelong( _ones, _adr2, %eval(8*%eval(&numvars.+1)) );

          call pokelong( _zero, _adr3, %eval(8*%eval(&numvars.+1)) );

       do i=1 to &numiter. until (lag(sum_wtx_sq) eq sum_wtx_sq);

          do j=0 to &numvars.;

         z = 0;

      sum_wtx_sq = 0;

      do _n_=1 to &nobs.;

        yhat = p_[0];

        do k=1 to &numvars.;

           yhat = sum(yhat, p_*x_[_n_, k]);

           end;

        proby = 1/(1 + exp(-yhat));

        if 0.00001 <= proby >= 0.99999 then do;

           proby = round(proby, 1);

       weight = 0.00001;

           end;

        else weight = proby*(1 - proby);

        z = sum(z, (x_[_n_,j]*(y_[_n_] - proby) + weight*p_*(x_[_n_,j])**2));

        sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_n_,j])**2);

        end; *endLoop: _n_= [nobs];

      if j>0 then do;

        if gamma<abs(z/&nobs.) then do;

                      if z>0 then p_ = (z/&nobs. - gamma);

       else p_ = (z/&nobs. + gamma);

           p_ = p_/(sum_wtx_sq/&nobs. + lambda - gamma);

       end;

        else p_=0;

        end;

                else p_ = z/sum_wtx_sq;

         end; *endLoop: j= [numvars];

          end; *endLoop: i= [numiter];

      parm_unstnd_[0] = p_[0];

             do _n_=1 to &numvars.;

         parm_unstnd_[0] = parm_unstnd_[0] - ((p_[_n_]*mean_[_n_])/std_[_n_]);

      parm_unstnd_[_n_] = p_[_n_]/std_[_n_];

         end;

      output;

      put "Final parm_unstnd_

  • =" parm_unstnd_
  • ;
  •    end; *endLoop: lambda= [lambda_list];

       end; *endLoop: alpha= [alpha_list];

       keep alpha lambda parm_unstnd_:;

       stop;

    run;

    Final parm_unstnd_

  • =-6.968638061 0 0.0910300569 0.0147581702 0 0 -0.013146089 0 0.8256638121 0 0
  • NOTE: There were 442 observations read from the data set WORK.DIABETES_STND_ARRAY2.

    NOTE: The data set WORK.COORD_DESCENT has 1 observations and 13 variables.

    NOTE: DATA statement used (Total process time):

          real time           0.06 seconds

          cpu time            0.05 seconds

    R:

    library("glmnet")

    set.seed(20)

    foo <- read.table("http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", header=TRUE, sep="\t", strip.white=TRUE)

    x <- as.matrix(foo[1:10])

    y <- ifelse(foo$Y > 140, 1, 0)

    system.time(bar <- glmnet(x, y, family = "binomial", alpha = 0.5) )

       user  system elapsed

       0.03    0.00    0.03

    print(coef(bar, s=0.1))

    11 x 1 sparse Matrix of class "dgCMatrix"

                          1

    (Intercept) -6.97872002

    AGE          .        

    SEX          .        

    BMI          0.09119253

    BP           0.01478218

    S1           .        

    S2           .        

    S3          -0.01316737

    S4           .        

    S5           0.82665323

    S6           .        

    View solution in original post

    20 REPLIES 20
    ballardw
    Super User

    Just seeing a two dimensional array makes me ask if you have access to SAS/IML. It is designed for such things in general.

    RobF
    Quartz | Level 8

    No - my company only has SAS EG loaded on the server. I'm hoping a fast & efficient program can be wriiten in base SAS without bringing in other modules.

    ballardw
    Super User

    I'm not quite sure but I think this line:

    do i=1 to &numiter;

    may be the issue. It appears that you are iterating the exact same code 100000 times in the second loop controlled by j.

    Do you get different results for different values of &numiter with all else the same?

    Reeza
    Super User

    Echoing @ballardw suggestion, usually there is some option to check if parameters have stabilized so you don't need to keep repeating. It usually ends up being a do while loop of some kind.

    RobF
    Quartz | Level 8

    Right,  I am converging to a good solution after about 1,000 iterations which takes only about 3 seconds. I chose 100,000 iterations since that's the baseline # of iterations in the R glmnet routine.

    However, for much larger datasets which I foresee I'll be running in the future, the runtime could conceivably take hours. I don't know, maybe this is the best I can do in Base SAS.

    I just saw this article by Rick Wicklin which may help out even if I'm not using proc iml:

    http://blogs.sas.com/content/iml/2013/05/15/vectorize-computations.html

    Kurt_Bremser
    Super User

    Be aware that the data step compiler will not be on par with something like gcc or a FORTRAN compiler when it comes to optimizing code. There's too much runtime checking built in for what is basically interpreted code, so my guess is that you will never reach the performance of R for such operations.

    I'm very fond of SAS, but if ever one of my "customers" wants to try solving a problem with R, he/she will have my full support in getting the SAS data warehouse data into R.

    "How often have I told you blokes to use the right tool?" - LtCmdr Montgomery Scott

    RobF
    Quartz | Level 8

    Thanks Kurt - is there a way to curtail the runtime checking in SAS to increase performance?

    Kurt_Bremser
    Super User

    No. The data step compiler does not have any optimization options, as data steps are usually heavily I/O bound, and SAS provides prefabricated and optimized procedures for most statistic tasks which tend to be CPU bound.

    FriedEgg
    SAS Employee

    You are missing any stopping criteria for the coordinate decent algorithm to stop iterating when it has converged for a given lambda.

    RobF
    Quartz | Level 8

    Hi FriedEgg -

    No, with the fast approaching deadlines for the global forum I didn't have time to include stopping criteria, or include wrapper code for using 5 or 10-fold cross validation to find optimal values of the alpha or lambda penalty values.

    Instead I concentrated on just getting the algorithm to converge to a solution (:-P) by comparing output with glmnet in R using the diabetes test dataset for logistic and Poisson regressions. The code matches glmnet output reasonably well for ~1,000 iterations (but not perfect). I'm hoping the code is useful as a niche market solution for SAS users without access to Revolution Analytics who want to run a lasso on a dataset too big for R to handle.

    Robert

    FriedEgg
    SAS Employee

    Yes, but this is the main factor (I think) behind why your code is running considerably longer.  Since R will most likely stop far before it reaches the maximum iterations.

    I am not going to speak to the validity of the following as a stopping criteria, but checking the value of sum_wtx_sq against it's lag for no change is valid for at least this test.

    I made a few minor edits to you code, below, but nothing that makes any real change to the algorithm (beyond adding the stopping criteria I mention above).

    data coord_descent;

    array x_[&nobs.,0:&numvars.] _temporary_;

    array y_[&nobs.]           _temporary_;

    /* cast data set to arrays x[i,j], y*/

    retain x0-x&numvars.;

    _adr1=addrlong(x0);

    do _n_=1 to &nobs.;

       set work.diabetes_stnd_array2;

       y_[_n_]=y_gt140;

       call pokelong( peekclong(_adr1), addrlong(x_[_n_,0]), %eval(8*%eval(&numvars.+1)) );

       end;

    array p_[0:&numvars.];

    _adr2=addrlong(p_[0]);

    _ones=repeat(put(1, rb8.), &numvars.);

    array parm_unstnd_[0:&numvars.];

    _adr3=addrlong(parm_unstnd_[0]);

    _zero=repeat(put(0, rb8.), &numvars.);

    array mean_[0:&numvars.] _temporary_ (%meanlist);

    array std_[0:&numvars.] _temporary_ (%stdlist);

    do alpha=&alpha_list.;

       do lambda=&lambda_list.;

          gamma = alpha*lambda;

          call pokelong( _ones, _adr2, %eval(8*%eval(&numvars.+1)) );

          call pokelong( _zero, _adr3, %eval(8*%eval(&numvars.+1)) );

       do i=1 to &numiter. until (lag(sum_wtx_sq) eq sum_wtx_sq);

          do j=0 to &numvars.;

         z = 0;

      sum_wtx_sq = 0;

      do _n_=1 to &nobs.;

        yhat = p_[0];

        do k=1 to &numvars.;

           yhat = sum(yhat, p_*x_[_n_, k]);

           end;

        proby = 1/(1 + exp(-yhat));

        if 0.00001 <= proby >= 0.99999 then do;

           proby = round(proby, 1);

       weight = 0.00001;

           end;

        else weight = proby*(1 - proby);

        z = sum(z, (x_[_n_,j]*(y_[_n_] - proby) + weight*p_*(x_[_n_,j])**2));

        sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_n_,j])**2);

        end; *endLoop: _n_= [nobs];

      if j>0 then do;

        if gamma<abs(z/&nobs.) then do;

                      if z>0 then p_ = (z/&nobs. - gamma);

       else p_ = (z/&nobs. + gamma);

           p_ = p_/(sum_wtx_sq/&nobs. + lambda - gamma);

       end;

        else p_=0;

        end;

                else p_ = z/sum_wtx_sq;

         end; *endLoop: j= [numvars];

          end; *endLoop: i= [numiter];

      parm_unstnd_[0] = p_[0];

             do _n_=1 to &numvars.;

         parm_unstnd_[0] = parm_unstnd_[0] - ((p_[_n_]*mean_[_n_])/std_[_n_]);

      parm_unstnd_[_n_] = p_[_n_]/std_[_n_];

         end;

      output;

      put "Final parm_unstnd_

  • =" parm_unstnd_
  • ;
  •    end; *endLoop: lambda= [lambda_list];

       end; *endLoop: alpha= [alpha_list];

       keep alpha lambda parm_unstnd_:;

       stop;

    run;

    Final parm_unstnd_

  • =-6.968638061 0 0.0910300569 0.0147581702 0 0 -0.013146089 0 0.8256638121 0 0
  • NOTE: There were 442 observations read from the data set WORK.DIABETES_STND_ARRAY2.

    NOTE: The data set WORK.COORD_DESCENT has 1 observations and 13 variables.

    NOTE: DATA statement used (Total process time):

          real time           0.06 seconds

          cpu time            0.05 seconds

    R:

    library("glmnet")

    set.seed(20)

    foo <- read.table("http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", header=TRUE, sep="\t", strip.white=TRUE)

    x <- as.matrix(foo[1:10])

    y <- ifelse(foo$Y > 140, 1, 0)

    system.time(bar <- glmnet(x, y, family = "binomial", alpha = 0.5) )

       user  system elapsed

       0.03    0.00    0.03

    print(coef(bar, s=0.1))

    11 x 1 sparse Matrix of class "dgCMatrix"

                          1

    (Intercept) -6.97872002

    AGE          .        

    SEX          .        

    BMI          0.09119253

    BP           0.01478218

    S1           .        

    S2           .        

    S3          -0.01316737

    S4           .        

    S5           0.82665323

    S6           .        

    RobF
    Quartz | Level 8

    Cool, thanks FriedEgg for the code edits & everyone else for the helpful comments.

    Ah, I missed the fact that glmnet is exiting the loop after the convergence criteria (1E-07) is reached, Ok it makes sense now why I'm seeing the runtime discrepancies.

    There are still a few more improvements to be made to the code. I don't need to continue updating parameters once their estimate is zero.

    Also the inner loop:

                         do k=1 to &numvars;

                   yhat = sum(yhat, p_*x_[_nn_,k]);

                         end;

    is unnecessarily repeating the same product calculation for values that aren't changing during each iteration of the loop do j=0 to &numvars;

    jakarman
    Barite | Level 11

    although you do not have IML you cuould have used SAS(R) 9.4 DS2 Language Reference, Fourth Edition with the matrxc package it is part of base.
    No idea it would dot the job. The difference is it should be better dot he job instead of the datamanipulation goal (base SAS). Your are too late for that anyway.

    ---->-- ja karman --<-----
    jakarman
    Barite | Level 11

    It helps as shown to eliminate unnecessary loops. Wel done "friedegg"

    Still the datastep and ds2 are build for data-manipulation.  When you can use the "SAS procs" they are build for analyzing/reporting, you are better with those. A better name for "SAS procs"are packages as they are pre_build solutions. However as packages are also used to indicate other programming styles that can be confusing. The matrix package and SQL package within DS2 are examples of those.
    Optimizing code is optimizing resource usage (minimizing)  and using the best code-parts for those. 

    ---->-- ja karman --<-----

    sas-innovate-2024.png

    Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

    Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    What is Bayesian Analysis?

    Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 20 replies
    • 1730 views
    • 9 likes
    • 6 in conversation