BookmarkSubscribeRSS Feed

Solving Linear Regression Models via Mathematical Programming: Ordinary Least Squares

Started ‎04-22-2024 by
Modified ‎04-22-2024 by
Views 237

 

This purpose of this series is to demystify the optimization model building process by building several familiar linear regression models in SAS using the OPTMODEL procedure. This post is geared towards data scientists familiar with linear regression models who are interested in learning more about mathematical programming.

 

The easiest way to learn a new data science topic (e.g., mathematical programming), in my opinion, is to anchor it to an existing topic that you're already familiar with, so this post assumes you understand the fundamentals of linear regression.

 

The data being used are the sashelp.baseball data set, which contains 322 players. The dependent variable (y) is nRuns, which is the number of runs a player scored during the season. The two independent variables (x1, x2) being used are nHits and nBB, which are the number of hits and walks, respectively, for each player during the season.

 

The screen shot below shows the raw data for the first 15 players.

 

01_JL_raw-data.png

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

There are numerous procedures and techniques for running ordinary least squares regression in SAS. If you prefer to do it in matrix algebra, the IML procedure is an excellent resource. Other procedures in SAS for OLS are REG, GLMSELECT, GLM, etc. The REG procedure code and output is below.

 

proc reg data=sashelp.baseball;
  model nRuns = nHits nBB;
 run;
quit;

 

The general functional form of the model is provided below.

 

02_JL_general-functional-form.png

 

Even with only two independent variables, the linear predictor appears to do an adequate job fitting the data.

 

03_JL_param-est.png

 

04_JL_model-fit.png

 

From the Parameter Estimate column above, the linear predictor is constructed from the general functional form:

 

05_JL_linear-predictor.png

 

Recall that in OLS, the objective is to find parameter estimate values that minimize the residual sum of squares.

 

06_JL_ols-obj.png

 

Or written the "long" way:

 

07_JL_functional-form-long.png

 

Replacing the actual parameter estimates from the model with the Betas in the equation above, the output from the REG procedure effectively told us this function is minimized with the following parameter estimate values:

 

08_JL_long-way-param-est.png

 

If you're still tracking with me to this point, we have everything we need to reconstruct the OLS model as a quadratic programming problem within the OPTMODEL procedure.

 

A mathematical programming problem can be broken down into basic components that are used to solve the problem. It's similar to baking a cake, where the cake can be broken down into the various ingredients used to make it. Similarly, a mathematical programming problem can be broken into its own "ingredients" used to formulate it.

 

The first "ingredient" are sets. A set is a collection of distinct objects (or elements) that share common characteristics. For this problem, we have two sets: a set of observations (i.e., players), and a set of independent variables (i.e., nHits and nBB).

 

Inside of the OPTMODEL procedure, sets can be read from SAS data sets or defined explicitly from within the procedure. For example, the code below declares a character set of PLAYERS. When you hear the word "declare", think "create a memory space for". In OPTMODEL, when the data come from external data sources, you must declare sets before populating them with values.

 

proc optmodel;

set <str> PLAYERS;

 

The Name column from sashelp.baseball above contains one distinct player per row. In mathematical programming, each distinct player name is an element in the set named PLAYERS. The set name is user-defined, and "PLAYERS" seems to be a reasonable name for the elements it will contain. If you recall, there are 322 players in the sashelp.baseball data set, meaning there will be 322 elements in the PLAYERS set once the data has been read in. The "str" within angle brackets indicates that PLAYERS is a character set, since the Name column is character. If it were instead a numeric set, you would instead specify "num" within angle brackets.

 

The second set is a space-delimited list of independent variables called IVARS. Unlike the PLAYERS set, each element in IVARS is explicitly defined between a pair of forward slashes "/". When explicitly defining elements within a set, "str" or "num" within angle brackets is not required, as the OPTMODEL procedure is able to infer the type from the hard-keyed elements.

 

set IVARS = /'nHits' 'nBB'/;

 

The next ingredient are parameters. Do not confuse parameters with parameter estimates! A parameter in a mathematical programming model is simply a known constant that is used throughout the model. In other words, what input data are we using to build the model? We have our known target variable nRuns for each player, along with our known independent variables for each player, nHits and nBB.

 

These are the three numeric parameters we'll create inside of the OPTMODEL procedure. To do this, we'll use the num statement.

 

The first parameter, y, corresponds to the target variable nRuns, and is indexed by PLAYERS using curly brackets { }. Think of it like an Nx1 vector, where y[1] is the number of runs scored by the first player in the PLAYERS set, y[2] is the number of runs scored by the second player in the PLAYERS set, and so on.

 

num y{PLAYERS};

 

The next parameter, x, is indexed by both the PLAYERS set and the IVARS set. Think of it like a 322x2 matrix with players down the rows and the independent variables along the columns.

 

num x{PLAYERS,IVARS};

 

The read data statement reads the sashelp.baseball data set into the OPTMODEL procedure and populates the PLAYERS set and parameters accordingly.

 

read data sashelp.baseball into PLAYERS=[Name] y=nRuns {k in IVARS} <x[Name,k]=col(k)>;

 

For a sanity check, you can now optionally write the elements in the PLAYERS set to the log by typing:

 

put PLAYERS=;

 

Additionally, you have the option to print the parameters to the Results Viewer.

 

print y;
print x;

 

The third ingredient in a mathematical programming model are the decision variables. The decision variables are the unknown variables that you want to find the optimal values of. In our OLS regression model, these are the betas.

 

In mathematical programming, decision variables are allowed to take on one of three types: continuous, integer, or binary. For continuous and integer decision variables, user-defined lower and/or upper bounds can be applied. For OLS, the betas are unbounded and allowed to take fractional values.

 

In this example, we have three decision variables: B0 (intercept), B1, and B2.

 

Decision variables are declared using the var statement in OPTMODEL. The first decision variable, Intercept, corresponds to B0 in the functional form above. The set of decision variables Beta are indexed by the IVARS set, meaning the OPTMODEL procedure will create as many Beta decision variables as there are elements in the IVARS set. Since there are only two elements in IVARS, "Beta{IVARS}" will create one decision variable for nHits and one for nBB.

 

var Intercept;
var Beta{IVARS};

 

The fourth ingredient in a mathematical programming model is the objective function. In the world of machine learning, it's sometimes referred to as the loss function, implying a minimization problem, but the objective function in a mathematical programming model can be either a maximization or a minimization function.

 

Using the sum{ } aggregation operator to represent big Sigma, the objective function in the OPTMODEL procedure mimics the functional form below, minimizing the residual sum of squares.

 

09_JL_functional-form-long.png

 

min Obj = sum{i in PLAYERS} (y[i] - (Intercept + sum{k in IVARS} Beta[k]*x[i,k]))**2; 

 

To turn it into a maximization problem, you could simply multiply the objective function above by -1.

 

max Obj = -1*(sum{i in PLAYERS} (y[i] - (Intercept + sum{k in IVARS} Beta[k]*x[i,k]))**2); 

 

The last ingredient in a mathematical programming model are constraints. Constraints are specific conditions or limitations that restrict the feasible solutions in an optimization problem. There are no constraints in the OLS model. In mathematical programming, this is called an unconstrained model.

 

The term "linear" regression means the model is linear with respect to the Betas, or decision variables. In statistics you may've heard it as "linear in the parameters", which is also referring to the Betas, but as mentioned above, the term "parameter" can be confusing since it means different things to different audiences.

 

To solve this using mathematical programming, we need to use a quadratic programming solver (not a linear programming solver), due to the squared term in the objective function.

 

For most optimization models, you can simply specify "solve;" and the OPTMODEL procedure will automatically determine and apply the appropriate solver for your model.

 

solve;

 

The remaining code below prints the optimal decision variable values to the Results Viewer, and outputs a SAS data set called work.output containing each player, along with the observed and predicted value for the target variable nRuns.

 

print Intercept;
print Beta;

create data work.output from [player] = 
  {i in PLAYERS} y pred=(Intercept + sum{k in IVARS} Beta[k]*x[i,k]);

quit;

 

Among other things, the Solution Summary table confirms the QP solver was used, specifically the Interior Point algorithm. The Solution Status indicates the solution is optimal, the objective function is minimized at 25,853.92, and the solution time was 0.01 seconds.

 

10_JL_sol-summary.png

 

The optimal decision variable values (i.e., the Betas) match the output from the REG procedure above.

 

11_JL_betas.png

 

Lastly, the first 15 observations from the work.output table are shown below, including both the observed and predicted values for the target variable nRuns.

 

12_JL_output-table.png

 

In a future post, we'll use the OPTMODEL procedure to solve another linear programming model called Least Absolute Deviation, or LAD regression. It's similar to OLS, but instead of minimizing the residual sum of squares, the objective function in LAD minimizes the sum of the absolute deviations between the observed and predicted values. LAD is considered a more robust regression model since residuals are penalized linearly (as opposed to quadratically in OLS).

 

In summary, the purpose of this post is to show how OLS can be constructed as a mathematical programming model. The OPTMODEL procedure is classified as an algebraic modeling language which allows us to build and solve optimization models of varying size and complexity. For those interested in learning more about how to build mathematical programming models using the OPTMODEL procedure but are wondering where to start, hopefully this example provided you with a gentle introduction to the process along with a straightforward introduction to the programming syntax. For another introductory example, check out how you can optimize your fantasy football lineup using mathematical programming.

 

Full code below:

 

proc optmodel; 
 set <str> PLAYERS;
 set IVARS = /'nHits' 'nBB'/;

 num y{PLAYERS};
 num x{PLAYERS,IVARS};

 read data sashelp.baseball into PLAYERS=[Name] y=nRuns {k in IVARS} <x[Name,k]=col(k)>;

 var Intercept; 
 var Beta{IVARS};

 min Obj = sum{i in PLAYERS} (y[i] - (Intercept + sum{k in IVARS} Beta[k]*x[i,k]))**2;

 solve;

 print Intercept; 
 print Beta;

 create data work.output from [player] = 
   {i in PLAYERS} y pred=(Intercept + sum{k in IVARS} Beta[k]*x[i,k]); 
quit;

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎04-22-2024 09:58 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags