Automatic Linearization Using the OPTMODEL Procedure: Least Absolute Deviation (LAD) Regression

5 Likes

In my previous blog, the focus was on building an Ordinary Least Squares (OLS) regression model using SAS's algebraic modeling language, the OPTMODEL procedure, which is designed for building and solving optimization models. In this blog, we'll focus on another type of linear regression model called Least Absolute Deviation (LAD).

Unlike OLS regression, which minimizes the sum of the squared residuals, LAD regression minimizes the sum of the absolute value of the residuals. LAD is considered a more robust regression model since residuals are penalized linearly (as opposed to quadratically in OLS).

The tradeoff for increased robustness is a larger residual variance, assuming the error distribution is normal. Depending on the circumstances, LAD is often a preferred alternative to OLS when known outliers are present in the data.

Within the model formulation, the only difference between LAD and OLS is in the objective function.

The objective function for LAD is shown below, which minimizes the sum of the absolute value of the residuals:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In a perfect world, we could simply modify the objective function from the OLS formulation, re-run the model, and output the new results. However, the problem is that LAD regression is more computationally difficult to solve than OLS due to the presence of the absolute value function, which introduces non-smoothness into the optimization problem. This "non-smoothness", for lack of a better term, makes it difficult to find a closed-form solution, often resulting in convergence issues for the optimization algorithm, causing it to iterate indefinitely and ultimately failing to find the optimal solution.

But don't take my word for it. Let's try it out.

Below are the set, parameter, and decision variable declarations copied over from the OLS formulation in my previous blog, along with the read data statement, which reads in the sashelp.baseball data set into the OPTMODEL procedure and populates the sets and parameters accordingly.

proc optmodel;

set <str> PLAYERS;
set IVARS = /'nHits' 'nBB'/;

num y{PLAYERS};
num x{PLAYERS,IVARS};

read data sashelp.baseball into PLAYERS=[Name] y=nRuns {k in IVARS} <x[Name,k]=col(k)>;

var Intercept;
var Beta{IVARS};

The new LAD objective function below is the only modification made from the original OLS model in the previous blog. SAS programmers will recognize the abs() function is being used to represent absolute value.

min Obj = sum{i in PLAYERS} abs(y[i] - (sum{k in IVARS} Beta[k]*x[i,k] + Intercept));

The solve statement is specified below. The create data statement creates an output data set of the observed and predicted values for each player. Recall that when you specify the default or standalone solve statement, OPTMODEL will determine and apply the appropriate solver from a set of optimization solvers including LP, MILP, NLP, and QP.

solve;

create data work.lad_output from [player] = 
  {i in PLAYERS} y pred=(Intercept + sum{k in IVARS} Beta[k]*x[i,k]);

quit;

The results below show the NLP solver was called, and due to the reasons mentioned above, the iteration limit was reached without arriving at an optimal solution.

Well that didn't work! What do we do now? Well... we increase the iteration limit of course! From the Solution Summary table, the default maximum iteration limit is shown to be 5,000. Let's boost that up to 100,000! To do this, I'll modify the solve statement by calling the NLP solver explicitly and setting the maxiter= option to 100,000.

solve with nlp/maxiter=100000;

Both the log and Solution Summary table indicate the model failed to converge. Interestingly, the process suspended after 53 iterations, suggesting that simply increasing the number of iterations isn't going to help solve this problem.

At this point, you may be tempted to throw up your hands, say "to heck with it", and just use OLS instead. But stay with me. There's an easy workaround.

In fact, there's another solver option within the OPTMODEL procedure that will quickly and easily solve LAD to optimality. That option is the solve linearize option.

Linearization in optimization is a powerful technique used to reformulate nonlinear functions with simpler linear functions, allowing us to solve previously unsolvable problems. By replacing complex nonlinear relationships (e.g., absolute value, etc.) with its linear equivalent formulation, optimization algorithms can be efficiently applied to solve the problem.

In other words, linearization simplifies the optimization process by transforming nonlinear objective functions (or constraints) into their linear equivalents, making them amenable to the analytical and numerical techniques required for finding optimal solutions.

The solve linearize statement in the OPTMODEL procedure contains numerous linearization techniques to automatically reformulate nonlinear models into their linear equivalent form. This means that we, as optimization practitioners, do not need to be experts or memorize all of the linearization techniques to easily take advantage of them!

Performing automatic linearization of our LAD model is as simple as modifying the solve statement and re-running the program:

solve linearize;

The Solution Summary table now shows the linear programming (LP) solver was called using the default Dual Simplex algorithm. For the first time, we see the solution status is Optimal with a solution time of 0.02 seconds.

Printing the decision variables (i.e., parameter estimate values) to the Results Viewer shows the new model estimates.

Below is a comparison to the OLS estimates from the previous blog:

	OLS	LAD
Intercept	-4.233	-4.861
nBB	0.3008	0.3077
nHits	0.4299	0.4336

Problem solved. The solve linearize statement automatically reformulated the nonlinear LAD model into its linear equivalent and quickly solved it to optimality. The output data set provides us with the observed and predicted values for each player in the sashelp.baseball data set, allowing us to further construct any of the desired plots and model diagnostics of our choosing.

As a result, I had originally intended to conclude the blog here. However, I'd argue that half the fun is learning what the solve linearize statement did behind the scenes that allowed our LAD model to solve to optimality.

Quick side story: I used to love visiting magic shops as a kid. I was always enamored with all of the seemingly impossible magic tricks performed by the owner to entice us to purchase the trick. They would perform the magic trick for you, but if you wanted to learn how it worked, you had to purchase it. Only after you purchased it would they show you the secret(s) and teach you to perform it yourself. Even to this day, I love visiting magic shops, and The Prestige is, still to this day, one of my all time favorite movies.

However, unlike magic tricks, the solve linearize statement has nothing to do with slight of hand, hidden compartments, or smoke and mirrors. In fact, if we knew the linearization technique off the top of our heads, we could program the reformulated model directly inside of the OPTMODEL procedure and bypass the solve linearize statement altogether. But for now let's assume we don't.

In order to reverse engineer what this one instance of solve linearize did for us automatically, we first need to determine the new model formulation that was used. To do this, you can specify expand/linearize; directly after the solve linearize; statement:

solve linearize;
expand / linearize;

This will write the reformulated linear model to the Results Viewer.

** Pro Tip **: When doing this, first modify the size of your problem.

To keep the expand/linearize; output manageable, the read data statement is modified to include only players from the Atlanta Braves with last names beginning with H or M. This effectively reduces the set of players from 322 to 5. The put statement writes each of the five elements in the PLAYERS set to the log.

read data sashelp.baseball(where=(Team='Atlanta' and substr(Name,1,1) in ('H','M'))) 
  into PLAYERS=[Name] y=nRuns {k in IVARS} <x[Name,k]=col(k)>;

put PLAYERS=;

Running the solve linearize statement now with the expand/linearize option on the reduced set of players shows the linear reformulation that the OPTMODEL procedure performed behind the scenes.

The expand/linearize output shows the OPTMODEL procedure added a set of decision variables for each player: _ADDED_VAR_[1], [2], ... , [5]. These are the absolute values of the residual for each player.

Two sets of constraints are added to the model for each player, _ADDED_CON_[1], [2], ... , [10], and serve as bounds to constrain the added set of decision variables depending on if the observed value is greater than or less than the predicted value.

Let's now replicate this new model formulation using the OPTMODEL procedure.

First I'll create a new set of decision variables, Z, for each player, corresponding to _ADDED_VAR_[1], [2], ... , [5].

var Z{PLAYERS};

According to the expand/linearize; output, the new objective function minimizes the sum of Z across all elements in the PLAYERS set.

min Obj = sum{i in PLAYERS} Z[i];

For the two sets of constraints in the expand/linearize; output, notice that for each player that the linear combination of decision variables all reside on the left-hand side of the inequality. This is done intentionally, but from an applied modeling perspective, we can re-write these constraints any way we'd like to, so long as they're algebraically equivalent to the ones above.

con y_larger{i in PLAYERS}:
  Z[i] >= y[i] - (sum{k in IVARS} Beta[k]*x[i,k] + Intercept);

con yhat_larger{i in PLAYERS}:
  Z[i] >= - y[i] + (sum{k in IVARS} Beta[k]*x[i,k] + Intercept);

Take the first player in the reduced set, Bob Horner, as an example. Bob's observed value (nRuns) is 70, denoted as y[i] in the OPTMODEL code. Suppose the model predicts a value for Bob less than 70. Let's say it predicts 65. The residual would be +5 (i.e., 70 - 65).

In the first constraint, Z[i] >= 5, and in the second constraint, Z[i] >= -5. The smallest value Z[i] is allowed to take while satisfying both constraints is 5, which is the absolute value of Bob's residual.

If the model instead predicts a value for Bob greater than 70, say 80, then his residual is -10 (i.e., 70 - 80). In the first constraint, Z[i] >= -10, and in the second constraint, Z[i] >= 10. Again, the smallest value Z[i] is allowed to take while satisfying both constraints is 10, which is the absolute value of Bob's residual.

This new model formulation is the linear equivalent to the original formulation containing the nonlinear objective function, and with the help of the expand/linearize; statement, we've effectively reverse engineered what the solve linearize statement did automatically behind the scenes for us.

I hope you found this blog useful and intuitive, and I encourage you to explore all of the automatic linearization techniques packed into the solve linearize; statement. Others include minimax and maximin objective functions, the product of two binary variables, the product of a binary and bounded variable, etc.

See below for the complete list of programs discussed above and try them out for yourself!

--------------------------------------------------------------------------------------------------------------------------

Nonlinear (original) model formulation with and without the solve linearize; and expand/linearize; statements:

proc optmodel;

set <str> PLAYERS;
set IVARS = /'nHits' 'nBB'/;

num y{PLAYERS};
num x{PLAYERS,IVARS};

read data sashelp.baseball into PLAYERS=[Name] 
  y=nRuns {k in IVARS} <x[Name,k]=col(k)>;

*read data sashelp.baseball(where=(Team='Atlanta' and substr(Name,1,1) in ('H','M'))) 
  into PLAYERS=[Name] y=nRuns {k in IVARS} <x[Name,k]=col(k)>;

var Intercept;
var Beta{IVARS};

min Obj = sum{i in PLAYERS} abs(y[i] - (sum{k in IVARS} Beta[k]*x[i,k] + Intercept));
 
solve;
*solve with nlp/maxiter=100000;
*solve linearize;
*expand / linearize;

create data work.lad_output from [player] = 
  {i in PLAYERS} y pred=(Intercept + sum{k in IVARS} Beta[k]*x[i,k]);

quit;

Linear formulation:

proc optmodel;

set <str> PLAYERS;
set IVARS = /'nHits' 'nBB'/;

num y{PLAYERS};
num x{PLAYERS,IVARS};

read data sashelp.baseball into PLAYERS=[Name] 
  y=nRuns {k in IVARS} <x[Name,k]=col(k)>;

var Intercept;
var Beta{IVARS};

var Z{PLAYERS};

min Obj = sum{i in PLAYERS} Z[i];

con y_larger{i in PLAYERS}:
  Z[i] >= y[i] - (sum{k in IVARS} Beta[k]*x[i,k] + Intercept);

con yhat_larger{i in PLAYERS}:
  Z[i] >= - y[i] + (sum{k in IVARS} Beta[k]*x[i,k] + Intercept);

solve;

create data work.output from [player] = 
  {i in PLAYERS} y z pred=(Intercept + sum{k in IVARS} Beta[k]*x[i,k]);

quit;

Automatic Linearization Using the OPTMODEL Procedure: Least Absolute Deviation (LAD) Regression

Free course: Data Literacy Essentials

Get Started