Implementing Machine Learning with SAS PROC SUPERLEARNER

3 Likes

Before diving into a machine learning project, it's crucial to understand your dataset's structure and parameters. This initial insight guides the selection of the most suitable algorithm for your predictive goals. Since no single algorithm excels in every scenario, this post will explore PROC SUPERLEARNER. This powerful SAS procedure addresses the challenge of algorithm selection by expertly combining various user-specified models, ranging from traditional regressions to advanced nonparametric machine learning techniques, to achieve optimal predictions.

The PROC SUPERLEARNER model builds a powerful predictive system by combining insights from multiple individual models, called base learners. Much like an ensemble, it intelligently selects the optimal blend of these learners to significantly boost prediction accuracy. Essentially, given input data (predictors 'x') and a target variable ('y'), the Super Learner operates in two layers to precisely define their relationship.

PROC SUPERLEARNER Algorithm

The first layer includes a library of B individual base learner models:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

where, these variables represent vectors that are the model parameters for the base learners 1 through B.

The second layer of PROC SUPERLEARNER contains the meta-learner model m, which is a function of the individual base learners in the first layer:

where is the vector of the meta-learner model parameters, also referred to as the super learner model coefficients.

About the Data set

To start, we selected a dataset from our SASHELP database for this demonstration. For this demonstration we will use the home equity dataset. The home equity dataset consists of 18 columns and 5,960 rows. It has a binary target variable on whether a person has defaulted or not on their loan amount. The dataset provides basic information such as location, reason for the loan, home value, mortgage due amount, city, state, division, and region.

Print Home Equity Dataset

In this first step, we print a few observations from the home equity dataset. This dataset is already pre-saved in memory on the SAS server, so it will not require any type loading process. However, if you’d like to save the dataset to a specific table you can create a DATA step and save it to a library you have created.

proc print data=sashelp.homeequity (obs=10);
run;

From the above figure, we observed the initial 10 rows of the Home Equity dataset, it's clear that several columns contain missing data. Addressing these gaps is the necessary next step before proceeding with the PROC SUPERLEARNER procedure.

/* Descriptive Statistic of the data */
proc means data=sashelp.homeequity;
run;

From the illustration, we can observe that we have few variables that’s have missing data. We will PROC SQL to handle the missing data for these variables to ensure we don’t lose valuable data points for our observation.

Handle Missing Data

In this section, we will handle missing data before performing any machine learning algorithms. We use the PROC SQL to handle the missing values within our dataset, we’ll replace any missing value with the mean of each column. There is a negative effect to taking the mean of the columns to replace the values.

proc sql;
create table HM.homeequity as select BAD, coalesce(MORTDUE, mean(MORTDUE))
as MORTDUE, coalesce(VALUE, mean(VALUE)) as VALUE, coalesce(YOJ,
mean(YOJ)) as YOJ, coalesce(DEROG, mean(DEROG)) as DEROG,
coalesce(DELINQ, mean(DELINQ)) as DELINQ, coalesce(CLAGE, mean(CLAGE))
as CLAGE, coalesce(NINQ, mean(NINQ)) as NINQ, coalesce(CLNO, mean(CLNO))
as CLNO, coalesce(DEBTINC, mean(DEBTINC)) as DEBTINC, JOB, LOAN ,REASON,
APPDATE, REGION, CITY, STATE, DIVISION from sashelp.homeequity;
quit;

We use the PROC SQL code to perform a crucial data preprocessing task, specifically imputation of missing numerical values using the mean. After we handle the missing data, a new table is created called HM.homeequity by selecting all columns from sashelp.homeequity. For several key numerical variables (MORTDUE, VALUE, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, and DEBTINC), the COALESCE function is used to replace any missing values with the calculated mean of that specific column. This ensures that the dataset is complete for subsequent modeling steps, preventing errors or biases that can arise from missing data, while preserving the integrity of the non-missing values for BAD, JOB, LOAN, REASON, and various date and geographical identifiers.

From the figure, we were able to replace the missing values for all our integers with the mean of their respective columns. To confirm look at the original figure displaying the first 10 rows. Row 4 had many missing values across its respective columns, now that we have handled the missing values, we are ready to use PROC SUPERLEARNER algorithm.

PROC SUPERLEARNER STEP

Now that we have handled any missing values, we can now start to structure our PROC SUPERLEARNER step. Remember that the PROC SUPERLEARNER is like an ensemble model. We choose a few different machine learning algorithms and evaluate algorithms accuracy.

proc superlearner data=HM.homeequity seed=1234;
    target BAD / level=nominal; /* Assuming BAD is the target variable */
    input REASON JOB / level=nominal;
    /* Using REASON and JOB as nominal inputs */
    input LOAN MORTDUE VALUE YOJ DEBTINC / level=interval;
    /* Using selected numerical columns as interval inputs */
    baselearner 'forest' forest(inbagfraction=0.6);
    baselearner 'gradboost' gradboost(learningrate=0.45);
    baselearner 'lightgradboost' lightgradboost(BAGGINGFRACTION=0.8 LEAFSIZE=50);
    baselearner 'svm'svmachine(KERNEL=POLYNOMIAL);
run;

This SAS PROC SUPERLEARNER code is designed to build a powerful predictive model using the HM.homeequity dataset. Its goal is to understand and predict "BAD" outcomes (treated as a numerical value) by analyzing various factors. The code incorporates both categorical information, such as REASON and JOB (likely types of loans or employment), and numerical financial data like LOAN amount, MORTDUE, property VALUE, years at YOJ (job), and DEBTINC (debt-to-income ratio). To achieve a highly accurate prediction, it employs an advanced "ensemble" strategy, combining the individual strengths of four distinct machine learning algorithms: a Random Forest, a Gradient Boosting model, a Light Gradient Boosting model, and a Support Vector Machine.

The above figure summarizes how a "Super Learner" machine learning model was built to predict "BAD" outcomes from home equity data. Think of it as a team of 4 different specialized AI models working together, trained and tested five times on different parts of the data. The goal was to predict a continuous value for "BAD" by minimizing squared errors, and the "team captain" (a method called Convex constrained Least Squares) figured out the best way for them to collaborate. Out of nearly 5,960 data points available, only 5,536 were used to teach the model, with a specific "seed" ensuring the process is repeatable.

This figure displays the "report card" for a Super Learner model, showcasing the contributions of its four base learner algorithms. The table presents each individual prediction model Forest (Random Forest), Gradient Boosting Tree, Light Gradient Boosting Machine, and Support Vector Machine—alongside two key metrics. The "Coefficient" indicates how much the Super Learner weighed that specific model, with a higher number signifying greater influence on the final prediction. "Cross-Validated Risk," on the other hand, measures each model's independent performance, where a lower value denotes superior accuracy. Among these, the Light Gradient Boosting Machine clearly stood out with the highest coefficient and the lowest risk, proving to be the most impactful and best-performing model in the ensemble.

Conclusion

In conclusion the PROC SUPERLEARNER is a very useful tool and can be used on any type of dataset. It offers a flexible approach to predictive modeling by combining the strengths of multiple algorithms into a single model that can be optimized simultaneously. The positive of using the PROC SUPERLEARNER is its range of different models that can be used to achieve your analytic goal. Whether you used a simple parametric regression to a complex non-parametric algorithm. The PROC SUPERLEARNER model can simplify the selection process and mitigates the selection bias while enhancing the model generalization. The only limitation of the PROC SUPERLEARNER algorithm is the lack of interpretability being deemed a “black box” the primary focus on this algorithm is its prediction power rather than the explanation of the model. For further information and a more in-depth understanding of PROC SUPERLEARNER see the links and post by Manoj Singh where he takes a deeper dive into the algorithm structure and capabilities.

For more information:

Find more articles from SAS Global Enablement and Learning here.