There is no universally best machine learning model for every dataset or prediction problem. For a given task, it is often challenging to determine in advance which model will deliver the best performance. The Super Learner (SL) addresses the challenge of algorithm selection by incorporating a wide range of user-specified models from traditional parametric regressions to modern nonparametric machine learning methods such as neural networks, support vector machines, decision trees and many others. Rather than relying on a single “best” model, Super Learner leverages the strengths of a diverse set of algorithms.
The Super Learner model builds a predictive model by combining the predictions from multiple individual models, known as base learners. It evaluates these base learners and systematically chooses the best combination to enhance predictive accuracy. In short, the Super Learner is a type of ensemble method that combines heterogeneous base learners and uses V-fold cross-validation to assign optimal weights to each model. Base learners that demonstrate better out-of-sample performance tend to receive larger coefficients and generally contribute more to the final predictions.
According to optimality theory, the Super Learner is guaranteed to perform at least as well as the best-performing base learner in the ensemble when the sample size is large, with performance measured using a bounded loss function (Van der Laan, Polley, and Hubbard, 2007). Super Learning is also referred to as stacked regression, stacked generalization, or weighted ensemble by different communities within statistics and data science.
Given an input dataset with a set of predictor variables x and a response variable y, the Super Learner model uses a two-layer architecture to model the relationship between the predictors x and y. The first layer consists of a library of individual base learner models:
fl(x∣βl) for l ∈{1,………, L}, where β1, β2,……. βL are the model parameters for base learners 1 through L.
The second layer contains the meta-learner model which is a function of the individual base learners in the first layer: m(f1(x∣β1), f2(x∣β2),…,fL(x∣βL)∣α), where α is the vector of meta-learner model parameters, also known as the super learner model coefficients.
Training a super learner involves estimating the parameters of both the base learners and the meta-learner.
Step 1: Specify the Base Learner Models (Library)
Choose a diverse set of models (e.g., linear/logistic regression, decision tree, forest, SVM, etc.). These models are called base learners and together form the model library. The parameters of each base learner are estimated by fitting the model to the entire input data set.
The coefficients of the Super Learner model are estimated by following a series of general steps:
Step 2: Perform K-Fold Cross-Validation
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Generating predictions for each fold ultimately results in one prediction from every base learner for each observation in the data set.
Step 3: Create the Meta-level Data set
This is the core step of the Super Learner algorithm where a meta learner combines outputs from all base learners to form a new, improved model. With cross-validated (CV) predictions now available for each observation in our dataset, the next step is to integrate these predictions back into the main data set. This will allow us to evaluate and minimize a final loss function of interest—comparing the true outcomes with the corresponding CV predictions—as part of optimizing our overall predictive model. In K-fold cross-validation schemes, the meta-level data set contains the same number of observations as the input data set.
In both the meta-level data set and the input data set, the outcome variable is the same denoted as "Y", but the input features differ. Instead of using the original predictors "X", the meta-level data set uses meta-level covariates, referred to as Ŷ (Y-hat). These observed values of Ŷ are the predictions generated by models trained on the training data, using the original input features X. For example, ŷ1-model1 represents the predictions generated by model1 on the data from block 1 and ŷk-model1 refers to the predictions generated by model 1 on the data from block k and so on.
Step 4: Train the Meta-Learner
Fit a regression of the observed responses (Y) on the out-of-sample predictions obtained in step 2(iii) using data from all blocks. The meta-learner can be mathematically expressed as follows:
E{Y| ŷmodel1…… ŷmodelL} = α1 (ŷmodel1) + α2 (ŷmodel2) +………….+ αL (ŷmodelL), where ŷmodel1 through ŷmodelL denote the cross-validated predicted outcomes from model1 through modelL respectively and Y denote the observed response.
To determine the contribution of each candidate algorithm to the final Super Learner prediction, we apply non-negative least squares regression. This involves regressing the true outcomes on the predicted values, without including an intercept, while enforcing constraints that the coefficients remain non-negative and sum to one, i.e. α1+ α2……+αL =1 and αl ≥ 0 for all l ∈ {1,2….,L}. It is theoretically beneficial to restrict the coefficient search space to a convex combination.
For continuous outcomes, the Super Learner estimates the coefficients α by minimizing the cross-validated empirical risk, typically using the squared loss function. Solving this risk minimization problem under convex constraints where the coefficients are non-negative and sum to one is the default meta-learning approach for continuous response variables. For a binary response the coefficients are estimated by maximizing the binomial log likelihood function under the constraint that the coefficients remain non-negative and sum to one.
We estimate the coefficients for each algorithm in the model library. While we don’t save the individual cross-validated model fits, we do retain the calculated weights for each algorithm as these are the optimal ensemble weights for our data at hand. Next, we use these weights to generate the Super Learner model which can then be applied to new data to predict the values of the response variable.
Step 5: The Super Learner Model
Recall that in Step1, we trained each base learner on the full dataset. In this final step, we combine the weights learned by the meta-learner in Step 4 with the predictions from these base learners. The resulting ensemble constitutes the final Super Learner model, as illustrated below-
ŶSL= α1(ŷmodel1-fulldata) + α2 (ŷmodel2-fulldata) +…………. + αL (ŷmodelL-fulldata)
The discrete Super Learner model is a variant of the Super Learner that chooses a single base learner. So, instead of combining multiple base learners it chooses the one with the lowest cross-validated risk. This selected base learner is assigned a Super Learner coefficient of 1, while all other base learners receive a coefficient of 0. It is also known as the cross-validated selector.
The Super Learner algorithm is a type of ensemble learning method, but it has some unique characteristics that set it apart from more traditional ensemble techniques like bagging and boosting. Let us walk through the similarities and differences:
Similarities
Differences
Super Learner is a “black-box” algorithm and is primarily focused on prediction, not explanation. It doesn't inherently provide insight into how individual covariates contribute to the outcome. Algorithm treats all input variables equally, it doesn’t distinguish between confounders, treatments or intermediate variables. Super Learner doesn't naturally provide variable importance scores. Additional post-hoc methods are required to estimate the influence of predictors.
The Super Learner algorithm offers a powerful and flexible approach to predictive modeling by combining the strengths of multiple algorithms into a single, optimized model. By optimally weighting a diverse library of candidate models—ranging from simple parametric regressions to complex nonparametric algorithms, Super Learner mitigates model selection bias and enhances generalization. Grounded in statistical learning theory and supported by performance guarantees that benchmark it against the best achievable ensemble within its candidate library, Super Learner offers a robust, scalable framework for high-performance predictive modeling across a variety of applications.
In the next post I will discuss how Super Learner model is trained in SAS Viya.
References:
Van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6:25.
Find more articles from SAS Global Enablement and Learning here.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.