Stack Smarter, Not Harder: Meet the Super Learner Algorithm-(Part 1)

1 Like

There is no universally best machine learning model for every dataset or prediction problem. For a given task, it is often challenging to determine in advance which model will deliver the best performance. The Super Learner (SL) addresses the challenge of algorithm selection by incorporating a wide range of user-specified models from traditional parametric regressions to modern nonparametric machine learning methods such as neural networks, support vector machines, decision trees and many others. Rather than relying on a single “best” model, Super Learner leverages the strengths of a diverse set of algorithms.

The Super Learner model builds a predictive model by combining the predictions from multiple individual models, known as base learners. It evaluates these base learners and systematically chooses the best combination to enhance predictive accuracy. In short, the Super Learner is a type of ensemble method that combines heterogeneous base learners and uses V-fold cross-validation to assign optimal weights to each model. Base learners that demonstrate better out-of-sample performance tend to receive larger coefficients and generally contribute more to the final predictions.

According to optimality theory, the Super Learner is guaranteed to perform at least as well as the best-performing base learner in the ensemble when the sample size is large, with performance measured using a bounded loss function (Van der Laan, Polley, and Hubbard, 2007). Super Learning is also referred to as stacked regression, stacked generalization, or weighted ensemble by different communities within statistics and data science.

Super Learner Algorithm: Details

Given an input dataset with a set of predictor variables x and a response variable y, the Super Learner model uses a two-layer architecture to model the relationship between the predictors x and y. The first layer consists of a library of individual base learner models:

f_l(x∣β_l) for l ∈{1,………, L}, where β1, β2,……. β_L are the model parameters for base learners 1 through L.

The second layer contains the meta-learner model which is a function of the individual base learners in the first layer: m(f1(x∣β1), f2(x∣β2),…,f_L(x∣β_L)∣α), where α is the vector of meta-learner model parameters, also known as the super learner model coefficients.

Training a super learner involves estimating the parameters of both the base learners and the meta-learner.

Estimating the Parameters of Base Learners

Step 1: Specify the Base Learner Models (Library)

Choose a diverse set of models (e.g., linear/logistic regression, decision tree, forest, SVM, etc.). These models are called base learners and together form the model library. The parameters of each base learner are estimated by fitting the model to the entire input data set.

Estimating the Parameters of Super Learner Model

The coefficients of the Super Learner model are estimated by following a series of general steps:

Step 2: Perform K-Fold Cross-Validation

Randomly split all observations in the input data set into K disjoint blocks.
Train each base learner by using only data in blocks 2 through K, excluding block 1.
For each trained base learner, generate predictions for block 1.
Repeat steps ii and iii for (K-1) times, each time leaving out block k=2, 3….K for predictions. This step is known as k-fold cross-validation.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Generating predictions for each fold ultimately results in one prediction from every base learner for each observation in the data set.

Step 3: Create the Meta-level Data set

This is the core step of the Super Learner algorithm where a meta learner combines outputs from all base learners to form a new, improved model. With cross-validated (CV) predictions now available for each observation in our dataset, the next step is to integrate these predictions back into the main data set. This will allow us to evaluate and minimize a final loss function of interest—comparing the true outcomes with the corresponding CV predictions—as part of optimizing our overall predictive model. In K-fold cross-validation schemes, the meta-level data set contains the same number of observations as the input data set.

In both the meta-level data set and the input data set, the outcome variable is the same denoted as "Y", but the input features differ. Instead of using the original predictors "X", the meta-level data set uses meta-level covariates, referred to as Ŷ (Y-hat). These observed values of Ŷ are the predictions generated by models trained on the training data, using the original input features X. For example, ŷ_1-model1 represents the predictions generated by model1 on the data from block 1 and ŷ_k-model1 refers to the predictions generated by model 1 on the data from block k and so on.

Step 4: Train the Meta-Learner

Fit a regression of the observed responses (Y) on the out-of-sample predictions obtained in step 2(iii) using data from all blocks. The meta-learner can be mathematically expressed as follows:

E{Y| ŷ_model1…… ŷ_modelL} = α₁ (ŷ_model1) + α₂ (ŷ_model2) +………….+ α_L (ŷ_modelL), where ŷ_model1 through ŷ_modelLdenote the cross-validated predicted outcomes from model1 through modelL respectively and Y denote the observed response.

To determine the contribution of each candidate algorithm to the final Super Learner prediction, we apply non-negative least squares regression. This involves regressing the true outcomes on the predicted values, without including an intercept, while enforcing constraints that the coefficients remain non-negative and sum to one, i.e. α₁+ α₂……+α_L =1 and α_l ≥ 0 for all l ∈ {1,2….,L}. It is theoretically beneficial to restrict the coefficient search space to a convex combination.

For continuous outcomes, the Super Learner estimates the coefficients α by minimizing the cross-validated empirical risk, typically using the squared loss function. Solving this risk minimization problem under convex constraints where the coefficients are non-negative and sum to one is the default meta-learning approach for continuous response variables. For a binary response the coefficients are estimated by maximizing the binomial log likelihood function under the constraint that the coefficients remain non-negative and sum to one.

We estimate the coefficients for each algorithm in the model library. While we don’t save the individual cross-validated model fits, we do retain the calculated weights for each algorithm as these are the optimal ensemble weights for our data at hand. Next, we use these weights to generate the Super Learner model which can then be applied to new data to predict the values of the response variable.

Step 5: The Super Learner Model

Recall that in Step1, we trained each base learner on the full dataset. In this final step, we combine the weights learned by the meta-learner in Step 4 with the predictions from these base learners. The resulting ensemble constitutes the final Super Learner model, as illustrated below-

Ŷ_SL= α₁(ŷ_{model1-fulldata}) + α₂ (ŷ_{model2-fulldata}) +…………. + α_L (ŷ_{modelL-fulldata})

Discrete Super Learner Model

The discrete Super Learner model is a variant of the Super Learner that chooses a single base learner. So, instead of combining multiple base learners it chooses the one with the lowest cross-validated risk. This selected base learner is assigned a Super Learner coefficient of 1, while all other base learners receive a coefficient of 0. It is also known as the cross-validated selector.

Super Learner Algorithm Compared to Traditional Ensemble Techniques (Bagging/Boosting)

The Super Learner algorithm is a type of ensemble learning method, but it has some unique characteristics that set it apart from more traditional ensemble techniques like bagging and boosting. Let us walk through the similarities and differences:

Similarities

Both combine multiple models (called base learners) to improve predictive performance.
Both aim to reduce bias and variance by leveraging the strengths of different algorithms.
Both can use cross-validation to evaluate the model performance.

Differences

Super Learner algorithm uses V-fold cross-validation to learn optimal weights for base learners, whereas traditional Ensemble methods use Bagging (average/vote), and Boosting (sequential weighting).
In large sample settings, the Super Learner model performs at least as well as the best base learner model in the ensemble. In contrast, traditional ensemble methods do not offer such a theoretical performance guarantee.
Super Learners offer high flexibility by allowing users to define a library of diverse, heterogeneous algorithms. In contrast, methods like bagging and boosting typically rely on repeated use of a single type of base algorithm.
In Super Learners, the risk of overfitting is mitigated through rigorous cross-validation during model selection and weighting. In contrast, traditional ensemble methods vary in their susceptibility to overfitting—boosting tends to be more prone to it, while bagging generally offers greater stability.

Limitations

Super Learner is a “black-box” algorithm and is primarily focused on prediction, not explanation. It doesn't inherently provide insight into how individual covariates contribute to the outcome. Algorithm treats all input variables equally, it doesn’t distinguish between confounders, treatments or intermediate variables. Super Learner doesn't naturally provide variable importance scores. Additional post-hoc methods are required to estimate the influence of predictors.

Conclusion

The Super Learner algorithm offers a powerful and flexible approach to predictive modeling by combining the strengths of multiple algorithms into a single, optimized model. By optimally weighting a diverse library of candidate models—ranging from simple parametric regressions to complex nonparametric algorithms, Super Learner mitigates model selection bias and enhances generalization. Grounded in statistical learning theory and supported by performance guarantees that benchmark it against the best achievable ensemble within its candidate library, Super Learner offers a robust, scalable framework for high-performance predictive modeling across a variety of applications.

In the next post I will discuss how Super Learner model is trained in SAS Viya.

References:

SAS Documentation
Van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6:25.
Naimi, A. I., and Balzer, L. B. (2018). "Stacked Generalization: An Introduction to Super Learning".

Find more articles from SAS Global Enablement and Learning here.