SAS Viya Workbench: Machine Learning using the New SAS Viya Workbench Procedures

The purpose of this post is to learn how to use the new machine learning procedures in SAS Viya Workbench to build and evaluate machine learning models. SAS Viya Workbench is a new SAS programming environment that supports the use of both the SAS and Python languages. A key goal in SAS Viya Workbench is to make some of the new SAS Viya procedures available to users who do not necessarily need the power of the SAS Viya distributed computing environment.

Many of the procedures we discuss in this post will be familiar to SAS programmers who regularly use SAS Viya but may be unfamiliar to SAS programmers who have been working exclusively with SAS 9. These procedures were originally designed to run in Viya on the distributed Cloud Analytics Services (CAS) Server but have been adapted to run in SAS Viya Workbench using data in traditional SAS libraries. SAS Viya Workbench can also be used to execute Python code and build SAS models using Python syntax, but this post will focus on the SAS language functionality.

We created this example using the new SAS Notebook format in Visual Studio Code (VSCode), which has the filetype ‘.sasnb’. This is a new way to write and run SAS code and is currently available in SAS Viya Workbench. All of the code presented in this post can be copied into a simple ‘.sas’ SAS program or into ‘.sasnb’ SAS notebook in SAS Viya Workbench and you can execute it to follow along with the examples.

%let csvFile = https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/hmeq.csv;

/* Create an output file in the Workbench environment in the ExecutingSQL folder */
filename hmeq_csv "/workspaces/myfolder/MachineLearning/hmeq.csv";

/* Download the and save CSV */
proc http
    url ="&csvFile"
    method ="GET"
    out = hmeq_csv;
run;

We start by downloading the example data into SAS Viya Workbench using PROC HTTP. We provide the HTTP procedure with a URL to the CSV file and a SAS FILEREF pointing to where we want the CSV file to live in SAS Viya Workbench. In this case we created a folder named ‘MachineLearning’ in the default ‘myfolder’ location created automatically when spinning up a Workbench. If you are following along with this example you will have to create the ‘MachineLearning’ folder, and if you changed the default name for the ‘Mounting folder’ when you created your new Workbench you will have to modify the path after the filename statement to point to a folder in your workspace.

proc import datafile="/workspaces/myfolder/MachineLearning/hmeq.csv"
    out=work.hmeq
    dbms=csv
    replace;
run;

The data is in CSV format, so we use the PROC IMPORT to convert the CSV file to a SAS7BDAT file. Note that we put the output in the work library, although we could have also created our own SAS library to store this file. For the rest of this example, we will read from and write to the work library. This is notable because many of the SAS Viya Workbench procedures we use were formerly Viya-only procedures that required the use of a CAS library. In SAS Viya Workbench we can use these procedures with SAS datasets in a SAS library.

Often in machine learning projects it is a good idea to explore the data to become familiar with the input variables and the target variable. For this example we will skip that step and refer to a previous post on the same topic, but using Python in SAS Viya Workbench instead of SAS Procedures (SAS Viya Workbench: Python Machine Learning using sasviya.ml and scikit-learn).

/*create a partition to separate training and validation data*/
proc partition data=work.hmeq samppct=30 seed=919 partind;
    by BAD;
    output out=work.hmeq;
run;

Our first step in machine learning is to partition the data into training and validation samples, in this example we use the PARTITION procedures to create a new column in the data (a partition indicator) representing which samples belong to the training data (where the partition indicator is 0) and which samples belong to the validation data (where the partition indicator is 1). The validation sample will contain 30% of the original data while the training sample will contain the remaining 70% of the data. Notice that we partition by the target variable ‘BAD’, so the training and validation samples will both have the same percentage of target events as the original data.

/*impute missing values, using the training data to calculate median and mode*/
proc varimpute data=work.hmeq(where=(_PartInd_ = 0)) seed=919;
    input JOB REASON / ntech=mode;
    input LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE CLNO NINQ DEBTINC LOAN / ctech=median;
    *output out=work.hmeq_train copyvars=(BAD _PartInd_);
    code file="/workspaces/myfolder/MachineLearning/impute_score_code.sas";
run;

We impute missing values using the SAS Viya Workbench VARIMPUTE procedure, providing only the training data to the procedure (the where clause on the dataset name selects only values where the partition indicator is 0, which are our training samples). This means that that the imputation ‘parameters’ (the mode for the categorical inputs and the median for interval inputs) are calculated using only the training data. This ensures that we don’t have any information leakage from the validation data into the training process. The output statement is commented out because we only calculated imputation on the training data, but we will end up wanting to impute the whole dataset instead of just the training data. Instead, we use a code statement to create a score code file that can be used to impute missing values on both the original dataset, and any scoring data we encounter when we deploy the model. Throughout this example we will generate DATA Step score code for use in deploying the model. This score code can run anywhere SAS runs (it does not require SAS Viya or SAS Viya Workbench), so it can be deployed outside of SAS Viya Workbench if necessary.

/*apply the imputation to the training and validation data (using median and mode from training data)*/
data work.hmeq;
    set work.hmeq;
    %include '/workspaces/myfolder/MachineLearning/impute_score_code.sas';
run;

Applying the score code requires a simple DATA Step with the score code included after setting the data we want to score. In this example we use the median and mode values calculated exclusively on the training data to impute missing values on the whole dataset, including both training and validation partitions. For consistency, we will also use this same scoring method to score the machine learning models we train. This is not the only scoring method, but it is likely the simplest for many SAS users; we can also create Analytics Store (ASTORE) files containing a binary version of the scoring logic that can be deployed using the ASTORE Procedure.

/*select useful input variables*/
proc varreduce data=work.hmeq;
    ods output SelectedEffects=VarSelected;
    class IM_JOB IM_REASON;
    reduce supervised BAD = IM_JOB IM_REASON IM_LOAN IM_MORTDUE IM_VALUE IM_YOJ IM_DEROG IM_DELINQ IM_CLAGE IM_CLNO IM_NINQ IM_DEBTINC / maxeffects=10;
    display 'SelectionSummary' 'SelectedEffects';
run;

It can be useful to select relevant variables as inputs to machine learning models, especially when we have a lot of potential predictor variables. The SAS Viya Workbench VARREDUCE Procedure allows us to perform both supervised and unsupervised variable selection, in this case we use the “reduce supervised” statement to perform supervised variable selection, specifying both the target variable (BAD) and the potential inputs (all the IM_ variables we created after imputing missing values). Although we display the ‘SelectionSummary’ and ‘SelectedEffects’ results tables, we also use the ods output statement to store the ‘SelectedEffects’ results table in a SAS dataset in the work library named ‘VarSelected’. We will use that table in the next block of code to store the selected interval and nominal input variables in SAS macro variables for future use. In this example we explicitly selected the top 10 variables (as determined by the supervised selection algorithm) to use as inputs, but in practice it can be useful to include more input variables and threshold their inclusion based on the variance in the target explained by the collection of selected effects. We chose to stick with reducing the inputs to 10 inputs to demonstrate reading the variable names from the output table into SAS macro variables. This seems like a bit of unnecessary work when we only have 12 easy to understand input variables, but it can be very useful in settings where we might have over 100 inputs with potentially obscure names.

/*grab selected effects and store variable names in a macro for use in future procedures*/
proc sql noprint;
    select Variable into :inputs separated by ' '
    from work.varselected;
quit;

%put &inputs;

/*also grab nominal inputs in a separate list, we will need to specify which inputs are nominal for our machine learning models*/
proc sql noprint;
    select Variable into :nominals separated by ' '
    from work.varselected
    where Type="CLASS";
quit;

%put &nominals;

proc sql noprint;
    select Variable into :intervals separated by ' '
    from work.varselected
    where Type="INTERVAL";
quit;

%put &intervals;

The first PROC SQL call takes all of the selected variables from the work.varselected table and stores them in a macro variable named inputs separated by spaces. The other two PROC SQL calls do the same thing for the nominal and interval inputs, creating separate macro variables for each subset of inputs. These macro variables are lists of inputs that we will provide to the machine learning procedures. This doesn’t save much time in this example but can be very convenient when working with large data sets with many potential input columns.

/*fit a simple logistic regression model with no interactions, there are a lot of SAS procedures that can do this, we use the LOGSELECT procedure*/
proc logselect data=work.hmeq(where=(_PartInd_ = 0));
    class &nominals;
    model BAD(event="1") = &inputs;
    code file="/workspaces/myfolder/MachineLearning/logistic_score_code.sas";
run;

The LOGSELECT procedure is a SAS Viya Workbench procedure to build logistic regression models. Logistic regression isn’t exactly what we think of when we think of a machine learning model, but it’s always a good idea to compare our complex nonlinear models to a simple linear model. In this example we use the where clause on the dataset to only provide the training data to the model, and we specify the inputs, the target, and which of the inputs are nominal. Notice that we output a code file containing score code for the logistic regression model; we will use this score code on the validation data to evaluate model performance, and we can also use it to deploy the model. We could have also generated an Analytic Store binary file for scoring, but throughout this example we will stick with score code. Another thing to note is that this procedure includes a partition statement to create a partition within the procedure in case we didn’t split the data into training and validation samples, but it is a best practice to partition before performing any data preprocessing, so we use the partition we already created instead of letting the procedure create a new one.

/*fit a decision tree model*/
proc treesplit data=work.hmeq(where=(_PartInd_ = 0));
    class BAD &nominals;
    model BAD = &inputs;
    prune costcomplexity;
    code file="/workspaces/myfolder/MachineLearning/treesplit_score_code.sas";
run;

The TREESPLIT procedure is another SAS Viya Workbench procedure, this time to build a decision tree model. Again, we provide only the training data to the procedure, and we output a SAS DATA Step score code file instead of using an ASTORE binary file for deployment. We use cost complexity pruning to ensure that the decision tree doesn’t overfit the training data, other pruning methods are available in SAS but generally require validation data so we can evaluate model performance on the validation data as we remove leaf nodes and splits from the decision tree.

/*fit a random forest model*/
proc forest data=work.hmeq(where=(_PartInd_ = 0));
    target BAD / level=nominal;
    input &nominals / level=nominal;
    input &intervals / level=interval;
    code file="/workspaces/myfolder/MachineLearning/forest_score_code.sas";
quit;

The FOREST procedure is a SAS Viya Workbench procedure to build an ensemble of independent decision trees. In this case we don’t apply any cost complexity pruning to the trees and we accept the default values for the number of trees in the forest and the random sampling of rows and columns for each tree. In this case we train 100 trees, where each tree is trained on 60% of the training data, and at each split in each tree 4 of the 10 input variables are randomly selected to be considered for that split. This ensures that the trees in the forest a diverse, so our final model prediction is an average of a variety of different decision trees.

/*fit a tree-based gradient boosting model*/
proc gradboost data=work.hmeq(where=(_PartInd_ = 0));
    target BAD / level=nominal;
    input &nominals / level=nominal;
    input &intervals / level=interval;
    code file="/workspaces/myfolder/MachineLearning/gradboost_score_code.sas";
quit;

The GRADBOOST procedure is a SAS Viya Workbench procedure to build a sequential ensemble of decision trees where each tree in the sequence is designed to improve on the errors of the previous tree. Just like with the forest model we accept the defaults and train a sequence of 100 trees. Gradient boosting is a powerful approach to machine learning and can often yield the best results on complex data even with the default values. Of course, it can also overfit the training data so it is a good practice to both evaluate the performance of this model on validation data, and to compare model performance to simple linear models to ensure that we are not memorizing noise with this complex model. For both the forest and the gradient boosting models we output SAS DATA Step score code which we will use to score the validation data.

/*fit a support vector machine model*/
proc svmachine data=work.hmeq(where=(_PartInd_ = 0));
    target BAD / level=nominal;
    input &nominals / level=nominal;
    input &intervals / level=interval;
    kernel polynomial / degree=2;
    code file="/workspaces/myfolder/MachineLearning/svm_score_code.sas";
quit;

The SVMACHINE procedure is a SAS Viya Workbench Procedure to build a support vector machine model, in this case for our binary target variable. This is normally a linear model to find the best hyperplane (a multi-dimensional generalization of a line) to separate the two target classes, but we use a nonlinear quadratic polynomial kernel to allow this model to learn a nonlinear decision boundary. As usual, we generate SAS DATA Step score code to evaluate model performance on validation data.

/*score training and validation data using the fitted models (we can assess them separately after this)*/
data work.logistic_scored;
    set work.hmeq;
    %include '/workspaces/myfolder/MachineLearning/logistic_score_code.sas';
run;

data work.treesplit_scored;
    set work.hmeq;
    %include '/workspaces/myfolder/MachineLearning/treesplit_score_code.sas';
run;

data work.forest_scored;
    set work.hmeq;
    %include '/workspaces/myfolder/MachineLearning/forest_score_code.sas';
run;

data work.gradboost_scored;
    set work.hmeq;
    %include '/workspaces/myfolder/MachineLearning/gradboost_score_code.sas';
run;

data work.svm_scored;
    set work.hmeq;
    %include '/workspaces/myfolder/MachineLearning/svm_score_code.sas';
run;

proc print data=work.gradboost_scored(obs=10);
    var BAD P_BAD1;
run;

We use the DATA Step score code we generated from each machine learning procedure to score the entire HMEQ dataset, including both the training and the validation partitions. We can then calculate assessments on each partition separately so that we can compare the model performance on the training and validation samples to ensure that we are not overfitting the training data. Note that in this block of code we run a separate DATA Step for each model to score the HMEQ data, but we could also run a single DATA Step with the score code for all the machine learning models included to score the data with all the models in the same pass through the data. The problem with this approach is that each model generates the same predicted outcome column, P_BAD1, so we would have to rename this prediction variable after each %include statement. This would require a bit more coding but would be more computationally efficient, a concern that we would want to address with a larger dataset. In this example the output after scoring is a collection of scored datasets (one for each model) containing the true value of the target, BAD, and the predicted value of the target, P_BAD1.

/*assess model performance on the validation data*/
proc assess data=work.logistic_scored LIFTout=logistic_lift ROCout=logistic_roc;
    var P_BAD;
    target BAD / event='1' level=nominal;
    by _PartInd_;
run;

proc assess data=work.treesplit_scored LIFTout=treesplit_lift ROCout=treesplit_roc;
    var P_BAD1;
    target BAD / event='1' level=nominal;
    by _PartInd_;
run;

proc assess data=work.forest_scored LIFTout=forest_lift ROCout=forest_roc;
    var P_BAD1;
    target BAD / event='1' level=nominal;
    by _PartInd_;
run;

proc assess data=work.gradboost_scored LIFTout=gradboost_lift ROCout=gradboost_roc;
    var P_BAD1;
    target BAD / event='1' level=nominal;
    by _PartInd_;
run;

proc assess data=work.svm_scored LIFTout=svm_lift ROCout=svm_roc;
    var P_BAD1;
    target BAD / event='1' level=nominal;
    by _PartInd_;
run;

We use ASSESS procedure, another SAS Viya Workbench procedure, to compare the predicted value of the target with the true value of the target. In this case we specify that the target is nominal, and we compare the true value of the target, BAD, to the predicted value of the target, P_BAD1 for each of the models. Notice that we use a by statement with the partition indicator in the ASSESS procedure to ensure that we calculate separate assessments for the training and validation data samples. We will use the output from this procedure to plot cumulative lift and the ROC chart, so we store the LIFTout and ROCout tables in the work libraries with names corresponding to each model. Without these statements the lift and ROC information would be printed to the SAS Results page instead of saved in a dataset.

/*before we plot we have to merge all of the Lift and ROC output datasets*/
data work.lift;
    merge work.logistic_lift(rename=(_CumLift_ = logistic_cumulative_lift)) 
        work.treesplit_lift(rename=(_CumLift_ = treesplit_cumulative_lift)) 
        work.forest_lift (rename=(_CumLift_ = forest_cumulative_lift))
        work.gradboost_lift(rename=(_CumLift_ = gradboost_cumulative_lift))
        work.svm_lift(rename=(_CumLift_ = svm_cumulative_lift));
    keep _PartInd_ _Depth_ logistic_cumulative_lift treesplit_cumulative_lift forest_cumulative_lift gradboost_cumulative_lift svm_cumulative_lift;
    label logistic_cumulative_lift="Logistic Regression"
          treesplit_cumulative_lift="Decision Tree"
          forest_cumulative_lift="Random Forest"
          gradboost_cumulative_lift="Gradient Boosting"
          svm_cumulative_lift = "Support Vector Machine";
run;

data work.roc;
    merge work.logistic_roc(rename=(_Sensitivity_=logistic_sensitivity _FPR_=logistic_fpr))
          work.treesplit_roc(rename=(_Sensitivity_=treesplit_sensitivity _FPR_=treesplit_fpr))
          work.forest_roc(rename=(_Sensitivity_=forest_sensitivity _FPR_=forest_fpr))
          work.gradboost_roc(rename=(_Sensitivity_=gradboost_sensitivity _FPR_=gradboost_fpr))
          work.svm_roc(rename=(_Sensitivity_=svm_sensitivity _FPR_=svm_fpr));
    keep _PartInd_ _Cutoff_ logistic_sensitivity logistic_fpr 
                            treesplit_sensitivity treesplit_fpr 
                            forest_sensitivity forest_fpr 
                            gradboost_sensitivity gradboost_fpr
                            svm_sensitivity svm_fpr;
    label logistic_sensitivity="Logistic Regression"
          treesplit_sensitivity="Decision Tree"
          forest_sensitivity="Random Forest"
          gradboost_sensitivity="Gradient Boosting"
          svm_sensitivity = "Support Vector Machine";
run;

After running the ASSESS procedures we have output datasets containing model performance information for each model, but we would prefer to plot this information all together so we can easily compare the model performances across models. We use the DATA Step merge statement to combine the model assessments into a single dataset for lift information and a single dataset for ROC information. We must change the variable names in this dataset since the ASSESS procedure generates generic names that don’t specify with which model the assessment is associated. In this example we only keep the information we need for creating cumulative lift and ROC plots, but there are more assessments available in the individual tables for each model.

/*plot Cumulative Lift*/
ods graphics / height=8in;
ods html5 style=Illuminate;

title "Cumulative Lift for Training Data";
proc sgplot data=work.lift(where=(_PartInd_ = 0));
    series x=_depth_ y=logistic_cumulative_lift;
    series x=_depth_ y=treesplit_cumulative_lift;
    series x=_depth_ y=forest_cumulative_lift;
    series x=_depth_ y=gradboost_cumulative_lift;
    series x=_depth_ y=svm_cumulative_lift;
    xaxis grid label="Percentile Depth";
    yaxis grid label="Cumulative Lift";
run;

title "Cumulative Lift for Validation Data";
proc sgplot data=work.lift(where=(_PartInd_ = 1));
    series x=_depth_ y=logistic_cumulative_lift;
    series x=_depth_ y=treesplit_cumulative_lift;
    series x=_depth_ y=forest_cumulative_lift;
    series x=_depth_ y=gradboost_cumulative_lift;
    series x=_depth_ y=svm_cumulative_lift;
    xaxis grid label="Percentile Depth";
    yaxis grid label="Cumulative Lift";
run;

Now that the lift information for each model is stored in a single dataset, we can use that dataset with the SGPLOT procedure to create an assessment plot comparing the cumulative lift for each model. We use the partition indicator to create separate plots for the training and validation samples, and by inspection it doesn’t look like there is a major discrepancy in model performance between the samples, although it does look like the more complex models (random forest and gradient boosting) have better performance on training data than on validation data.

/*plot ROC*/
ods graphics / height=8in;
ods html5 style=Illuminate;

title "ROC for Training Data";
proc sgplot data=work.roc(where=(_PartInd_ = 0));
    series x=logistic_fpr y=logistic_sensitivity;
    series x=treesplit_fpr y=treesplit_sensitivity;
    series x=forest_fpr y=forest_sensitivity;
    series x=gradboost_fpr y=gradboost_sensitivity;
    series x=svm_fpr y=svm_sensitivity;
    lineparm x=0 y=0 slope=1 / lineattrs=(pattern=dash color=black) legendlabel='Random Guessing';
    xaxis grid label="False Positive Rate";
    yaxis grid label="True Positive Rate";
run;

title "ROC for Validation Data";
proc sgplot data=work.roc(where=(_PartInd_ = 1));
    series x=logistic_fpr y=logistic_sensitivity;
    series x=treesplit_fpr y=treesplit_sensitivity;
    series x=forest_fpr y=forest_sensitivity;
    series x=gradboost_fpr y=gradboost_sensitivity;
    series x=svm_fpr y=svm_sensitivity;
    lineparm x=0 y=0 slope=1 / lineattrs=(pattern=dash color=black) legendlabel='Random Guessing';
    xaxis grid label="False Positive Rate";
    yaxis grid label="True Postive Rate";
run;

We also plot the ROC information using the same SGPLOT procedure as before, and this time we also include a 45-degree “random guessing model” line on the ROC plot. This is the expected performance of a model that yields a false positive for every true positive, which is effectively what we would expect with a model that randomly assigns 0s and 1s as the target value rather than learning from the training data. It can be useful to see how much better we can do than this line, especially in situations where there is a lot of noise in the data, and it is challenging to build a model with strong performance on validation data. Again, we don’t see a major discrepancy between training and validation performance, indicating that we are not overfitting the training data.

Looking at the validation data for both the lift and the ROC plots reveals that the gradient boosting model is the best performing of the models that we have built. We just used the default settings for most of the models, so there is more work to do to ensure that we have the best possible model. We can continue to experiment with different model hyperparameters to see if we can improve the performance of the champion gradient boosting model. Note that in SAS Viya we can perform autotuning to automatically experiment with different hyperparameters and select the ones that yield the best performance, but this takes advantage of the distributed computing power in SAS Viya and is not currently available in SAS Viya Workbench.

/*Display misclassification for 'champion' model on validation data*/
title "Misclassification Rate for Gradient Boositing Model (Validation Data)";
proc print data=work.gradboost_roc(where=(_PartInd_=1)); 
    where round(_Cutoff_,0.01) = 0.5;
    var _Cutoff_ _MiscEvent_;
run;

The champion gradient boosting model has a misclassification rate of about 10% at the 0.5 cutoff (all predicted probabilities greater than or equal to 0.5 are classified as a 1, while predicted probabilities below 0.5 are classified as a 0).

proc sort data=work.gradboost_roc(where=(_PartInd_=1));
    by _MiscEvent_;
run;

proc print data=work.gradboost_roc(obs=1); 
    var _cutoff_ _TP_ _FP_ _FN_ _TN_ _MiscEvent_;    
run;

The 0.5 probability cutoff is a good starting point for a cutoff, but the lowest misclassification rate is not always found at this cutoff. In this case the best accuracy on the validation data is found at the 0.42 cutoff, with a misclassification of 0.097, which isn’t significantly different from the value found at the 0.5 cutoff. In business settings it is important to choose a cutoff that best achieves the business goals (often maximizing profit), which isn’t necessarily the one that minimizes misclassification. This can happen when the costs or consequences of false positives (_FP_) are different from the costs or consequences of false negatives (_FN_). The misclassification assessment treats these ‘mistakes’ as equal, but in practice this is not always true.

The last step in our analysis is to deploy the model, which can be done by combining the imputation score code with the gradient boosting score code and then applying it to new data containing the same variables as the original dataset. SAS Viya Workbench is designed as a development environment, so although we can score new data in a notebook or SAS program in the Workbench, we can also deploy the SAS DATA Step score code in a more systematic way anywhere that SAS can run, including SAS 9 deployments, SAS Viya deployments, and even some databases using the SAS Scoring Accelerator. This method of creating DATA Step score code is just one option; we could have also created SAS Analytic Store (ASTORE) binary files containing the scoring logic for deployment. These can also be deployed where SAS runs, this time using the ASTORE procedure. For both deployment methods the score code/scoring artifacts can be tracked and managed using SAS Model Manager, although this is a Viya product outside of SAS Viya Workbench.

References:

Find more articles from SAS Global Enablement and Learning here.

SAS Viya Workbench: Machine Learning using the New SAS Viya Workbench Procedures

The 2025 SAS Hackathon has begun!

SAS AI and Machine Learning Courses