Big Ideas in Machine Learning Modeling: The Bias-Variance Trade-Off

1 Like

As you probably know, there is abundant need for big data analysis and machine learning modeling these days. One concept that is important in learning big data analysis is the bias-variance trade-off. In this blog, I’ll describe and illustrate the bias-variance trade-off and explain how understanding this concept can simplify your learning experience by giving you a framework to view hyperparameter tuning in your machine learning models.

Most users of machine learning models are focused on predictive modeling. Predictive models, also called supervised learning models, involve a target variable (the focus of your business or research) and inputs (the predictors that hopefully inform you about the target). Supervised machine learning models identify input-target associations in order to predict future values of the target.

The focus of predictive modeling is to make the most accurate predictions possible. This is a different goal than statistical inference, which usually involves explaining the input-target relationships. For predictive modelers, explaining the relationships among variables is of secondary importance at best. To assess the accuracy of the predictions, two common measures are the mean square error (MSE) and average square error (ASE). Typically, we use MSE for measuring error in the training data and ASE when applying a model to a data set such as validation data that was not involved in creating the model. Whichever metric we use, we can view this as the prediction error that we want to minimize.

So, all else being equal, a predictive model with lowest error can be considered a better model. How do we reduce the prediction error in our machine learning models? By optimizing the bias-variance trade-off. Both bias and variance contribute to the prediction error.

What is bias? In the context of predictions, a model makes biased predictions when the predicted values systematically underestimate or overestimate the actual values. It is the difference between the expected value of an estimate and its true value. Bias cannot be measured in real data because it requires knowing the true values which we are trying to learn in the first place. Models that are too simple to capture the true patterns in the data have high bias. Typically, we can reduce bias by increasing the complexity of the model (more parameters, higher order terms, etc.).

What is variance? In the context of predictions, a model has high variance when the predicted values are highly variable when the model is applied to different samples from the same population. You can also think of variance as the sensitivity of the predictions to the particular data set used to train the model. Variance cannot be measured from fitting the model to a single data set as is typically done, although it could be measured by fitting it to several bootstrap samples of the same data. Note that high variance here does not imply anything about making correct predictions. It is about the variability in the predictions, not how accurate they are. Models that are overly complex for the data have high variance. “Overly complex” here means that the model describes aspects of the observed data that are not representative of the population and are unique to the observed data. That is, overly complex models model the “noise” in addition to the “signal”. A model with higher complexity is more flexible and can fit the data with lower bias. But because of the increased flexibility, the model’s predictions can differ dramatically between data sets.

So high bias is associated with models that are too simple to describe the true patterns in the population and high variance is associated with models that are unnecessarily complex. You can imagine that as you increase model complexity from too simple to too complex, the model bias decreases and variance increases. There is a trade-off, where lowering one increases the other.

Prediction error increases with both the bias and variance:

MSE = bias² + variance + irreducible error

, the irreducible error being a component of the MSE unrelated to model complexity that won’t be considered here. Finding the model that has the lowest error involves finding the right amount of complexity, that is, the bias and variance that minimizes the error. This amount is not something you will know before some trial and error. The trial and error should be guided by measuring model performance on a holdout validation data set. It turns out that adding a small amount of bias often results in a large reduction in variance and an overall decrease in prediction error.

Here are illustrations of models with high bias or high variance. We have a population with the relationship Y= 7+ 11X +3X². I’ll draw 3 samples from this population and fit models that are too simple (high bias) and models that are too complex model (high variance). Here’s how I generated the data for the first sample:

data sample;
    call streaminit (123);
    sample="sample 1";
    do i=1 to 50;
        X=i;
        y=7+0.3*X+40*X*X +5000*RAND("normal");
        true_y=7+0.3*X+40*X*X;
        output;
    end;
run;

The CALL STREAMINIT routine specifies a seed that starts the random number generator used by the RAND function. RAND(“normal”) generates a random value from a normal distribution with a mean of 0 and SD=1. Here’s a plot of the true relationship:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

When we approximate this true curvilinear relationship with a linear regression model, we are using a model that is too simple and the predictions are biased. At X-values near zero and near fifty, the predicted values (shown by the red line) are too low—they underestimate the true values. At X-values near 25, the linear regression overestimates Y.

What happens when we use a model that is too complex? I’ll approximate this curvilinear relationship with a spline function. Splines are flexible, piecewise polynomial functions joined together smoothly at join points called knots. I’ll use a spline model based on 10^th degree polynomials. The predictions are no longer biased too low at the extremes and too high near the center. But this spline model is overly complex, and it captures the noise in the data. Each little “bump” in the spline is a pattern found in this specific sample and not likely to be found in the next sample drawn from the same population. The spline model predictions have low bias, but they have high variance.

Let’s see a few more high bias and high variance models fit to other samples from the same population. To generate additional samples, I changed the seed of the CALL STREAMINIT statement in the SAS code above and concatenated the 3 samples into a data set called “combined”. I used the following PROC SGPANEL code to visualize models fit to these samples:

title "high bias";
proc sgpanel data=bias_var.combined;
    panelby sample / novarname columns=3;
    scatter x=X y=y /;
    reg x=X y=y / nomarkers lineattrs=(color=red);
run;

title "high variance";
proc sgpanel data=bias_var.combined;
    panelby sample / novarname columns=3;
    scatter x=X y=y /;
    pbspline x=X y=y / nomarkers degree=10 lineattrs=(color=green);
run;

We can see the linear regression is biased. In all three samples, the average predictions are different than the true values. So, this error is not a fluke. It is systematic due to using a model that is too simple for these data. Using a model that is too simple for the data is referred to as underfitting.

The 3 spline models demonstrate high variance. At first, the curves might seem similar but look at each peak and valley across samples. They don’t match up. That is, the predictions vary considerably across different samples. Models like these can be described as overfitting the data. These models are so specific to the data they were trained on that they will perform poorly at predicting new data. On average, the bias is low, but in practice, we will only have one sample for modeling. We may end up getting the sample and the model with predictions that are far from the average—there is no way of telling. Variance here is the sensitivity of the predicted values to the particular data set used to train the model.

So why does this matter? From a practical point of view, can’t we just focus on picking a model with lowest prediction error on a validation data set and ignore the bias-variance trade-off concept? Well, yes. But understanding bias and variance gives a conceptual framework that will help with using machine learning models. One situation this comes up in is tuning hyperparameters. Hyperparameters are the settings for a machine learning model that are established before the model is fit to the data. They set the guidelines for how the machine learning model goes about producing the predictions.

Here are some examples of hyperparameters that affect the bias and variance of the resultant model:

06_TE_taelna-bais-variance6-hyperparameter-table-1024x500.png

Many of the options in supervised machine learning SAS Viya procedures such as PROC BN, PROC FOREST, PROC GRADBOOST, PROC NNET, and PROC SVMACHINE are used to adjust the complexity of the model to optimize the amount of bias and variance. The same hyperparameters can appear in several models (e.g. L1, also called LASSO). Adjusting these will help find a model that avoids overfitting or underfitting. For example, a higher L1/LASSO parameter will make a model less sensitive to the training data to reduce the amount of noise being modeled and lower the chance of overfitting. Understanding bias and variance allows you to view a dozen different hyperparameters as methods to make models err on the side of too simple or too complex.

To learn more about the important concepts in machine learning modeling, try the SAS class “Statistics You Need to Know for Machine Learning”. This course covers the bias-variance trade off as well as many other important concepts that will help you on your journey as a machine learning modeler. It also is the best preparation for students interested in earning the SAS Credential “SAS Certified Associate: Applied Statistics for Machine Learning”.

See you at the next SAS class!

Links:

Course: Statistics You Need to Know for Machine Learning (sas.com)

SAS Certified Associate: Applied Statistics for Machine Learning

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library