Stack Smarter, Not Harder: Meet the Super Learner Algorithm-(Part 2)

In my previous article, I explored the theoretical foundations and framework of the Super Learner model, highlighting its flexibility, optimality, and advantages over traditional ensemble methods. In this follow-up piece, I shift the focus from theory to practice by demonstrating how to train and implement a Super Learner model using SUPERLEARNER procedure in SAS Viya offering a hands-on approach for practitioners looking to leverage the power of Super Learners within the SAS Viya environment. Starting LTS 2025.03 SUPERLEARNER procedure is available in SAS Viya as SAS Visual Statistics procedure.

The SUPERLEARNER procedure implements the super learner model in SAS Viya (Van der Laan and Rose 2011; Phillips et al. 2023; Naimi and Balzer 2018). You can use this procedure to train and score predictive models for continuous and binary response variables. It allows users to specify and customize a library of base learner models, supports the training of a cross-validated selector (i.e., the discrete Super Learner), and facilitates the creation of an item store for the trained model, which is then saved as a binary object within a data table for future use or deployment.

In the demonstration to follow you learn how to use the SUPERLEARNER procedure to train a super learner model.

Launch SAS Studio and submit the following program to define a CAS session and assign all available CAS libraries making them accessible for data operations.

cas;

caslib _all_ assign;

Next, you load a .sashdat file into a CAS library using CASUTIL procedure.

proc casutil;

    load file="/create-export/create/homes/xxxx/casuser/VS_BANK_PART.sashdat"

    outcaslib="casuser" casout="BANK";

run;

The file=option in the LOAD statement specifies the path to the .sashdat file you want to load. outcaslib=option tells CAS where to store the loaded data which is CASUSER library in this case and casout=option assigns the name BANK to the in-memory table once it's loaded.

The successful execution of above piece of code ensures that the BANK table is successfully loaded in memory and you are ready to begin your modeling journey.

Defining Model using SUPERLEARNER Procedure

When you're training a model using PROC SUPERLEARNER at a minimum, you’ll need to define a TARGET statement, one or more INPUT statements, and at least two BASELEARNER statements.

Training a Super Learner Model on a Continuous Response

proc superlearner data=casuser.Bank seed=12345;

    target int_tgt/ level=interval;

    input logi_rfm1-logi_rfm12/level=interval;

    input cat_input1 cat_input2 / level=nominal;

    baselearner 'Linear Reg' regselect;

    baselearner 'BART' bart(nTree=50);

    baselearner 'Dec Tree' treesplit (Criterion=Ftest);

    baselearner 'GAM' gammod;

    baselearner 'Forest' forest;

    output out=casuser.predout copyvar=int_tgt;

run;

You're using the Bank dataset stored in the CAS library. The seed=12345 ensures reproducibility by setting a random seed for any stochastic processes. You use the TARGET statement to specify the response variable which is Int_tgt (tgt Interval New Sales) in this case. Since Int_tgt is continuous (interval level) you will use the LEVEL= Interval to indicate that. This tells the procedure to use regression-based learners. You will typically include multiple INPUT statements to define your predictors. In this example the first INPUT statement includes twelve RFM variables (logi_rfm1 through logi_rfm12) which are considered interval predictors, and the second input statement includes two variables (cat_input1 and cat_input2) which are treated as categorical predictors. RFM refers to Recency, Frequency, and Monetary variables commonly used in customer behavior analysis.

To define a base learner using the BASELEARNER statement, you begin by assigning a name to the base learner enclosed in single quotes, followed by the model type. This statement supports a variety of model types and customizable options, allowing you to tailor each base learner to your specific modeling needs. In this example you are stacking five diverse base learners by using five BASELEARNER statements.

First base learner Linear Regression (regselect): A classic parametric regression model.
Second base learner Bayesian Additive Regression Trees (bart): A flexible, nonparametric model with 50 trees.
Third base learner Decision Tree (treesplit): Using F-test as the splitting criterion.
Forth base learner Generalized Additive Model (gammod): Captures nonlinear relationships.
Finally, the fifth base learner Forest (forest): An ensemble of decision trees for robust predictions.

The predicted results are saved to casuser.predout, and the original target variable int_tgt is copied for evaluation. The successful execution of code returns detailed output that provides insights into both the architecture and performance of your ensemble model.

Table 1.1 displays basic information about the super learner ensemble specification, and 1.2 displays the estimated coefficient for each base learner.

Table 1.1

By default, based on the input data sample size, the input data are divided into two folds for cross-validation. Because the target variable is continuous, by default the convex-constrained least squares meta-learning method is used to estimate the weight of each base learner in the super learner ensemble, and the quadratic loss is used to compute the cross-validated risk of each base learner.

Table 1.2

The Super Learner model coefficients table highlights how much each base learner contributes to the overall prediction. The Forest model has the highest coefficient (0.93611), indicating it contributes most to the ensemble's predictions. It also has the lowest cross-validated risk, suggesting strong performance. The Decision Tree model plays a minor role with a small coefficient (0.06389), whereas models like Linear Regression, BART and GAM have coefficients of 0 (zero), meaning they were not selected to contribute to the final ensemble.

Table 1.3 presents the contents of PREDOUT table, which contains the predicted results from a Super Learner procedure. The columns P_int_tgt and int_tgt represent the predicted target values and actual target values respectively.

Table 1.3

This output allows you to assess how closely the Super Learner model's predictions align with the actual target values.

Training a Super Learner Model on a Binary Response

Next you learn to fit a super learner model to the binary target. The Bank data set has a binary target variable, b_tgt (tgt Binary New Product) that codes responders or purchasers with 1 and non-responders with 0.

Just like the previous model you must specify the TARGET statement, at least one INPUT statement, and at least two BASELEARNER statements.

proc superlearner data=casuser.Bank seed=12345;

    target b_tgt/ level=nominal;

    input logi_rfm1-logi_rfm12/ level=interval;

    input cat_input1 cat_input2 / level=nominal;

    baselearner 'logistic' logselect;

    baselearner 'SVM' svmachine;

    baselearner 'forest' forest(ntrees=50);

    baselearner 'GB' GRADBOOST(ntrees=100 learningrate=0.15);

    crossvalidation kfold=5;

    output out=casuser.predprobout copyvar=(b_tgt);

    store out=casuser.SuperModel;

run;

In the TARGET statement level=nominal indicates a classification task with categorical outcomes. Each INPUT statement specifies predictor variables of the same type. In this example four base learners are specified. For the first base learner, 'logistic', you specify a logistic regression model by using the LOGSELECT model type. Second base learner adds a Support Vector Machine by specifying the SVMACHINE model type. The third base learner specifies a Forest model with 50 trees. For the fourth base learner, ‘GB’, you use the GRADBOOST model type. You set the number of trees to 100 and the learning rate to be 0.15. The KFOLD= option in the CROSSVALIDATION statement applies 5-fold cross-validation to mitigate the over-fitting of the training data and obtain good generalization. In the OUTPUT statement, the out=option saves the predicted probabilities to CASUSER.PREDPROBOUT table and the COPYVAR= option copy the target variable (b_tgt) from the input data table to the output data table.

You use the STORE statement to save the trained super learner ensemble model as an item store. In this example the saved item store, CASUSER.SUPERMODEL will be used to score new observations without having to retrain the model again. This is illustrated later in this example.

Upon successful submission of the code the output displays basic information about the super learner specifications and estimated coefficients for each base learner.

The Forest model dominates the ensemble with weight ≈ 71% and Gradient Boosting Tree being the secondary contributor with weight ≈ 29%. Overall, the Super Learner prediction is a weighted average of Forest and GB, which should give better generalization than any single model.

So, the final Super Learner is approximately:

Ŷ =0.70692×Forest Prediction+0.29308×GB Prediction

Scoring

Once you've trained and saved a super learner ensemble model, you can use it to score new or existing data. The following statements show how to use PROC SUPERLEARNER to score the new data by using the previously fitted model-

proc superlearner data=casuser.bank restore=casuser.supermodel;

    output out=casuser.scoredData1 learnerpred;

run;

To score new data using a previously trained model in PROC SUPERLEARNER, you specify the saved model in the RESTORE= CASUSER.SUPERMODEL option. The DATA= CASUSER.BANK identifies the input dataset containing the observations to be scored, while the OUTPUT statement defines the destination table (SCOREDDATA1) for the predicted results. Including the LEARNERPRED option ensures that the output table contains not only the ensemble predictions from the super learner but also the individual predictions from each base learner. Both the RESTORE= option and the OUTPUT statement are mandatory when scoring data with a stored model.

The P_b_tgt0 column contains super learner predictions, and the other columns contain predictions from each individual base learner. An excerpt of the output table is produced below-

By default, PROC SUPERLEARNER models the probability that the variable b_tgt takes the value 0. The name of the variable in the output data table that contains the predicted probability is P_b_tgt0. To obtain the predicted probability of b_tgt = 1, you can use a simple DATA step:

data casuser.predout1;

    set casuser.predout;

    P_b_tgt1=1 - P_b_tgt0;

run;

This step creates a new variable P_b_tgt1, which represents the probability of the event class (b_tgt = 1). This is particularly useful for decision-making, thresholding, or performance evaluation (e.g., ROC curves, lift charts).

Summary

PROC SUPERLEARNER in SAS Viya offers a powerful and flexible framework for predictive modeling. Whether you're working with categorical outcomes or continuous targets, it allows you to harness the strengths of multiple algorithms through ensemble learning. By carefully specifying input types, base learners, and scoring options, you can build models that are not only accurate but also interpretable and production ready.

References:

SAS Documentation
Naimi, A. I., and Balzer, L. B. (2018). “Stacked Generalization: An Introduction to Super Learning.” European Journal of Epidemiology 33:459–464.
Phillips, R. V., van der Laan, M. J., Lee, H., and Gruber, S. (2023). “Practical Considerations for Specifying a Super Learner.” International Journal of Epidemiology 52:1276–1285.
Van der Laan, M. J., and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer.

Find more articles from SAS Global Enablement and Learning here.