SAS Viya Workbench: Python Machine Learning using sasviya.ml and scikit-learn

The purpose of this blog is to learn how to use the sasviya.ml and sklearn Python packages in SAS Viya Workbench to build and evaluate machine learning models. SAS Viya Workbench is a new SAS programming environment that supports the use of both the SAS and Python languages. The Workbench environment is designed to support native Python programming, and SAS has released a proprietary Python package named sasviya.ml that contains optimized SAS machine learning algorithms designed to run in SAS Viya Workbench. Of course, SAS Viya Workbench can also be used to execute SAS procedures, but this blog will focus on the Python functionality.

We created this example using a Python Jupyter notebook and the Visual Studio Code IDE, and the code in this example requires the use of SAS Viya Workbench. All of the code presented in this blog can be copied into a .py Python program or a .ipynb Jupyter Notebook in SAS Viya Workbench and you can execute it to follow along with the examples.

import requests

# File path and name
file_path = r"/workspaces/myfolder/MachineLearning/hmeq.csv"

# Specify the URL of the CSV file
url = r"https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/hmeq.csv"

# Download the and save CSV file to Workbench
response = requests.get(url)
with open(file_path, 'wb') as f:
    f.write(response.content)
    print(f'File downloaded:{file_path}')

To make this example more portable, we start by downloading data from a web URL using the Python requests package. We define a file location in the SAS Viya Workbench where we want to save the CSV, in this case in the /workspaces/myfolder/MachineLearning/ folder, and we want to save the file as hmeq.csv. We download the data from a SAS documentation page and use the requests package in Python to write the URL content to a CSV file. The /workspaces/myfolder/ location was created when we initialized our Workbench environment, and we just manually added the MachineLearning folder using the VSCode interface.

import pandas as pd

hmeq_df = pd.read_csv(r"/workspaces/myfolder/MachineLearning/hmeq.csv")

Although we will be using SAS Viya Workbench machine learning algorithms, this example is all in Python, so we will use Pandas and Scikit-Learn to help load and prepare the data for machine learning. The SAS machine learning algorithms all live in the sasviya.ml Python package, and we will import them individually as we use them. We import pandas and read the CSV file into a pandas dataFrame named hmeq_df. SAS Viya Workbench includes commonly used Python packages for data science, and in general you can use PIP to install your own packages, although the write permissions and available packages will be determined by your site administrator. In this example we use pandas, numpy, and scikit-learn to prepare and evaluate our machine learning models. We didn’t have to pip install these packages; they were already included as part of the configuration of SAS Viya Workbench.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Before we start building models, let’s quickly build a little bit of intuition about this data. This is fictitious financial services data; we operate a bank and must decide whether to accept or reject loan requests from customers. Our target in this data is the variable BAD, or binary applicant default and it indicates whether our historical customers defaulted on their loan or not. Here BAD equals 1 indicates that the customer did default on the loan; BAD equals 0 indicates they did not default on the loan. The rest of the variables in the data represent information about the customer collected by the bank, which can be used as candidate input variables for our machine learning models. These input columns contain information like the amount of the home equity loan the customer is requesting, the amount left on their mortgage, the value of the mortgaged home, the reason they are requesting a loan, their job, their years on the job, along with some information about their past credit history like the numbers of derogatory credit reports and delinquent credit lines. The important thing about this data from a technical software perspective is that we have a mix of interval variables like LOAN and MORTDUE along with categorical variables like REASON and JOB (in our data these variables contain strings). We don’t have to do any dummy coding when using the sasviya.ml package, but the SAS machine learning models will treat the categorical input variables different from interval variables.

hmeq_df.head(10)

We use the head() method to print out the first 10 observations in our dataset. Looking at these 10 printed rows, we can see that we have some missing values in the data, in Python coded as numpy NaN values. We want to do something to replace these missing values, otherwise the rows with missing values won’t be included in our machine learning models, which can dramatically reduce the amount of data we have available for training the models. Imputing these missing values will be a part of our data preprocessing, which is our next step. This is the most basic preprocessing that we must do for machine learning (the bare minimum). There are lots of data preprocessing methods available in Python packages in SAS Viya Workbench that could be useful with different kinds of data, and in general when you are working with new data it is recommended to spend some time on data exploration, which can help better identify what kind of data preprocessing would be useful. For this example we will stick with simple data preprocessing, which will include partitioning the data, imputing missing values, and selecting relevant inputs variables. Partitioning the data into training and validation samples is a best practice that you cannot skip in machine learning; if you don’t have a validation sample you can’t really know if you have fit an effective model or if the model has just memorized the historical data used for training. We impute missing values; we do have a lot of missing values in the data, and we want to use as much of the data as possible. Automatic variable selection is more useful when you have a lot of hard-to-understand input variables (we only have 12 easy-to-understand input variables in this dataset), but we include it in this example because it can be useful to compare your intuition about what variables might be useful in predicting the target (manual variable selection) with the results of data preprocessing routines that automatically select variables.

#We use sklearn to partition the data into training and validation data
from sklearn.model_selection import train_test_split

X = hmeq_df.drop('BAD', axis=1)
y = hmeq_df['BAD']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=919)

print("Training data shape:", X_train.shape, "\n" + "Validation data shape:", X_valid.shape)

We split the HMEQ dataset into training and validation samples using the train_test_split function from the scikit-learn model selection package, sampling 30% of the data as a validation set and the remaining 70% as a training set. Scikit-learn expects us to split the inputs from the target, so we create a dataFrame X containing only the inputs and a dataFrame y containing the target BAD. We apply the train_test_split function to these dataFrames, specifying that we want a 30% test sample for the validation data, and that we want to stratify based on the target variable. This stratified sampling means that we will have the same proportion of BAD=1 (people who defaulted on their loans) in the training and validation data. These samples are supposed to represent the data our machine learning model will encounter in deployment, so we should preserve the target distribution from the original dataset. We also specify a random state to ensure that the sampling is reproducible when we run this notebook multiple times. This process outputs 4 different pandas objects, 2 dataFrames containing training and validation inputs (x_train and x_test), and 2 series containing training and validation targets (y_train and y_test). After the partitioning we have 4,172 training observations and 1,788 validation observations.

Now that we have built this partition, we can start using the training data to build models and do our data preprocessing, while remembering to apply all the data preprocessing steps to the validation data. Our first form of data preprocessing is to impute missing values, but we want to use the training data to calculate the median and the mode that we're going to use for imputation, and then we're going to use those values when imputing the validation data. The idea is that it's cheating to use information from the validation data to impute values on the training data, since information leaks from the validation sample into the training process. We need to learn our parameters from the training data, but the median and mode that we use for imputation can be thought of as model parameters associated with data preprocessing. This is also more reflective of a deployment scenario, where we will also use the median and mode from the training data to impute missing values in new observations that we want to score.

#We use sklearn to impute missing values (using the median/mode from the training data to impute the validation data)
from sklearn.impute import SimpleImputer
import numpy as np

#Calculate the median and mode on the training data to prepare the imputer
imp_interval = SimpleImputer(missing_values=np.nan, strategy='median')
imp_interval.set_output(transform='pandas')
imp_interval = imp_interval.fit(X_train.select_dtypes(include='number'))

imp_nominal = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_nominal.set_output(transform='pandas')
imp_nominal = imp_nominal.fit(X_train.select_dtypes(include='object'))

#Apply the imputation to the training data
X_train_imp_int = imp_interval.transform(X_train.select_dtypes(include='number'))
X_train_imp_nom = imp_nominal.transform(X_train.select_dtypes(include='object'))

X_train_imp = pd.concat([X_train_imp_int, X_train_imp_nom], axis=1)

#Apply the imputation to the validation data
X_valid_imp_int = imp_interval.transform(X_valid.select_dtypes(include='number'))
X_valid_imp_nom = imp_nominal.transform(X_valid.select_dtypes(include='object'))

X_valid_imp = pd.concat([X_valid_imp_int, X_valid_imp_nom], axis=1)

print("Training data shape:", X_train_imp.shape, "\n" + "Validation data shape:", X_valid_imp.shape)

We import the SimpleImputer function from the sklearn.impute package, and we create two imputer objects, one for the interval variables that imputes using the median, and one for the nominal variables that imputes using the mode (‘most frequent’ is the term used by sklearn). For both imputers we set the output to be a Pandas dataFrame (which is not the default for the imputers), this will be important later because the sasviya.ml algorithms are easier to use when our data is in the form of a Pandas dataFrame. In both cases we use the fit method on the training data to learn the imputation parameters (in this case just the median and mode for the columns with missing values) based only on the training data. It is important to separate the interval and the nominal variables for this, so we use the select_dtypes method to ensure that the median imputation is calculated for the numeric variables and the mode imputation is calculated for the textual variables. This could be a bit more complicated if you have nominal variables with numeric values, at which point it could be helpful to create lists of input variables separated by how they should be treated. Fitting the two imputer objects that we created just learns the model parameters from the training data, but we also must apply the imputation to both the training and validation data. We do this separately for the training and validation data, for each sample concatenating the interval and nominal input variables into a single dataFrame. At the end of this process, we still have the same size training and validation samples, but this time the 12 input columns have no missing values. We also have the imputer objects that we can use to apply the imputation to new data during the scoring process.

#Select useful input variables using chi-square test
from sklearn.feature_selection import SelectKBest, chi2

#Sklearn requires nominal variables to be dummy coded, for now let's skips this and just select from the interval variables

#Select variables based on the training data
selector = SelectKBest(chi2, k=8).fit(X_train_imp_int, y_train)
selector.set_output(transform="pandas")

#Apply the selection to the training and validation data
train_selected = selector.transform(X_train_imp_int)
valid_selected = selector.transform(X_valid_imp_int)

X_train_final = pd.concat([train_selected, X_train_imp_nom], axis=1)
X_valid_final = pd.concat([valid_selected, X_valid_imp_nom], axis=1)

print("Training data shape:", X_train_final.shape, "\n" + "Validation data shape:", X_valid_final.shape)

print("Selected interval features:", selector.get_feature_names_out())
print("Selected nominal features:", list(X_train_imp_nom.columns))

Variable selection is important for ensuring that the machine learning models are provided useful inputs. You can always select important variables based on subject matter expertise, but it’s often helpful to compare your intuition to an automatic variable selection approach. We import the SelectKBest function from scikit-learn to select the top 8 most relevant interval variables, based on a chi-square test. We only have 10 interval input variables so this will drop the 2 interval variables that are least useful in predicting the target. Scikit-learn requires categorical variables to be dummy coded (often using one-hot encoding) before they can be used for most machine learning and data preprocessing, but the sasviya.ml package allows us to skip this step so we will leave our categorical variables in their original form. For this example, we will just keep the 2 categorical variables as inputs and select the 8 best interval inputs using SelectKBest. Just like before we want to output a pandas dataFrame, and once we select variables based on the training data (once we fit the SelectKBest method), we apply the variable selection to the interval variables in both the training and validation data. After reducing the interval variables, we concatenate them back with the categorical variables to get the preprocessed training and validation input samples. This time these samples only have 10 columns, since we have eliminated 2 of the variables. We also print out the selected variables, but we didn’t apply any variable selection to the categorical variables, so we selected both without doing any calculations. We can now use these preprocessed input samples to fit our machine learning models (with the training data) and evaluate and compare the models’ performance (with the validation data).

The sasviya.ml package is designed to integrate with scikit-learn, so the sasviya.ml model objects have the same syntax and most of the same functionality as the equivalent scikit-learn model objects. The major difference is in the execution, the models in the sasviya.ml package execute using optimized SAS libraries, which takes advantage of multithreading and can yield much faster runtimes than the equivalent models in scikit-learn. If you are interested in comparing the performance of sklearn and sasviya.ml, there is a link to a repository of speed comparisons in the references at the end of this blog. The sasviya.ml syntax is nearly identical to the scikit-learn syntax (the constructor is the same, the objects have the same methods such as fit and transform, etc.), so if you have experience building machine learning models in scikit-learn, you can easily start building models using sasviya.ml. One convenient feature about this design is that you can take existing scikit-learn code and without modifying the actual code, you can change the import statement to point to sasviya.ml instead of scikit-learn. This will use the optimized SAS libraries for machine learning without requiring you to do any code rewrites. One difference between scikit-learn and the sasviya.ml package to keep in mind is that the scikit-learn models require all inputs to be numeric (so we must dummy code before sending the data to the model), whereas the sasviya.ml models are designed to do the dummy coding for you if you provide categorical inputs. This is a convenience feature, so all your legacy scikit-learn code will work with sasviya.ml, but you can skip a few steps in code when using the SAS optimized libraries. The rest of this notebook will look just like a scikit-learn demo, but we will import the models from sasviya.ml instead of from sklearn. If you have experience with scikit-learn and would prefer to organize your code differently or use features like sklearn pipelines, you can easily import your sklearn code into SAS Viya Workbench and modify it to use the optimized SAS algorithms.

#fit a simple logistic regression model
from sasviya.ml.linear_model import LogisticRegression

logreg = LogisticRegression(solver='lbfgs',
                        tol=1e-4,
                        max_iter=1000)

logreg.fit(X_train_final, y_train)

Let’s start building some different machine learning models, always making sure to compare to a simple logistic regression model. Logistic regression isn’t exactly what we think of when we think of a machine learning model, but it’s always a good idea to compare our complex nonlinear models to a simple linear model. We import the LogisticRegression object from sasviya.ml.linear_model and use the constructor to instantiate a logistic regression object named ‘logreg’. We accept most of the default settings for the logistic regression model, specifying that we want to use the LBFGS method for the likelihood maximization, with a convergence tolerance of 10^-4 and a maximum of 1000 iterations for the LBFGS solver. These settings are all about how the model estimates the parameters, but there is an option for choosing a variable selection method (like backward or stepwise) for the logistic regression model. After we construct the ‘logreg’ object, we can apply the fit method to the training data (X_train_final and y_train) to create our model. After the fitting process the ‘logreg’ object can be used with the transform method to score new data (which we will do later in the notebook when we evaluate model performance on the validation data).

#fit a decision tree model
from sasviya.ml.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier(criterion="chisquare",
                               max_depth=10,
                               ccp_alpha=0)

dtree.fit(X_train_final, y_train)

We can do the same thing with a decision tree model, importing the DecisionTreeClassifier object from sasviya.ml.tree. We use the constructor to instantiate the model object ‘dtree’, and this time we specify some key decision tree hyperparameters. We use the chi-square method as the split criterion, which is how we evaluate the quality of the splits in the decision tree. We also specify a max depth of 10, so any leaf node in the tree is no more than 10 splits below the root node. A smaller max depth leads to a simpler tree, which can be useful to avoid overfitting, but can also lead to a model that is too simple to represent the training data. Just like with the logistic regression model, we fit the dtree object using the training data, and we will use this same object later to evaluate performance on validation data.

#fit a random forest model
from sasviya.ml.tree import ForestClassifier

forest = ForestClassifier(criterion="chisquare",
                          n_estimators=100,
                          max_depth=7,
                          min_samples_leaf=5,
                          bootstrap=0.6,
                          random_state=919)

forest.fit(X_train_final, y_train)

Next, we create a random forest model by importing the ForestClassifier object from sasviya.ml.tree. A forest model is an ensemble of individual decision trees, with each tree fit to be slightly different from the other trees in the forest. When we create the forest model, we specify some hyperparameters related to how individual trees are grown, like max_depth, criterion, and min_samples_leaf. These hyperparameters are the same as the ones we use with the individual decision tree model. We can also specify hyperparameters associated with how we create the forest, like n_estimators and bootstrap. The forest model is only useful if the trees in the ensemble are different from one another, so we set the bootstrap options to 0.6 so that each tree is fit using a random sample with 60% of the training data. Once we use the constructor to define the model hyperparameters, we can fit the forest object using the training data.

#fit a tree-based gradient boosting model
from sasviya.ml.tree import GradientBoostingClassifier

gradboost = GradientBoostingClassifier(n_estimators=100,
                                       max_depth=4,
                                       min_samples_leaf=5,
                                       learning_rate=0.1,
                                       subsample=0.8,
                                       random_state=919)

gradboost.fit(X_train_final, y_train)

We can also fit a gradient boosting ensemble model, where instead of fitting a forest of independent trees, we fit a sequence of trees where each tree tries to improve on the performance of the previous trees. The constructor for the GradientBoostingClassifier object has many of the same hyperparameters as the forest and decision tree models, along with some options specific to creating the sequence of trees like the learning_rate hyperparameter. Notice with this model we use a much smaller maximum depth than with the previous models; in general, gradient boosting models are often more powerful and accurate than individual trees, but this comes with the risk of overfitting the training data. We use a small max depth for this model to help ensure that we don’t memorize the training data. Once we construct the gradboost object, we fit it using the training data.

#fit a support vector machine classifier
from sasviya.ml.svm import SVC

svm = SVC(C=1.0,
          kernel="rbf")

svm.fit(X_train_final, y_train)

Finally, we fit a support vector machine model (in this case a support vector classifier, or SVC model which we import from sasviya.ml.svm). For binary targets this is normally a linear model that finds a hyperplane to maximize the separation between the two classes. In this example we use the radial basis kernel function (known as the RBF kernel), which is a nonlinear kernel that projects the problem into a higher-dimensional space and then fits a linear hyperplane in this projection. In the original dimensions this becomes a nonlinear decision boundary. In addition to the kernel, we specify the penalty parameter C, which determines a cost penalty for incorrectly classified cases. Once we create the svm model object, we fit it using the training data.

Now that we have fit our five different machine learning models using training data (well really four machine learning models and a logistic regression as a baseline comparison) we can evaluate the performance on validation data and choose a champion model. The example we show in this blog uses a binary target, but each of the models we used also has an option for data with a continuous target. For most of the models we have a regressor option, so instead of using the ForestClassifier model, we would use the ForestRegressor model for a continuous target. More details about these regressor models are available in the SAS Viya Workbench documentation.

#score training and validation data using the fitted models (we can assess them separately after this)
models = ['logreg', 'dtree','forest','gradboost','svm']
train_out = dict.fromkeys(models, None)
valid_out = dict.fromkeys(models, None)

#score the training and validation data using the models, and join the predictions to the target
for model in models:
    train_out[model] = eval(model).predict_proba(X_train_final)
    train_out[model] = train_out[model].join(y_train)

    valid_out[model] = eval(model).predict_proba(X_valid_final)
    valid_out[model] = valid_out[model].join(y_valid)

To evaluate model performance and ensure we don’t overfit the training data, we score both the training and validation data using all our models. First, we create a list of models, with the name of the model on the list equal to the name of the model object we created during training. We then create an output dictionary for training and validation data from these models, so we have a place to store the scored training and validation data. We use a Python for loop to loop over the list of models, generating the predicted probabilities using the predict_proba method applied to the model object. We join these predicted probabilities with the true value of the target and save the scored output in the model dictionary. We do this separately for training and validation data to yield two output dictionaries, each containing dataFrames with scored output for each model.

#plot ROC
from sklearn.metrics import roc_curve
from matplotlib import pyplot as plt

train_roc = dict.fromkeys(models, None)
valid_roc = dict.fromkeys(models, None)

plt.figure(figsize=(10,8))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.title('ROC Curve for Training Data')
for model in models:
    train_roc[model] = roc_curve(train_out[model]['BAD'], train_out[model]['P_BAD1'], pos_label=1)
    plt.plot(train_roc[model][0], train_roc[model][1])
plt.legend(models)

plt.figure(figsize=(10,8))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.title('ROC Curve for Validation Data')
for model in models:
    valid_roc[model] = roc_curve(valid_out[model]['BAD'], valid_out[model]['P_BAD1'], pos_label=1)
    plt.plot(valid_roc[model][0], valid_roc[model][1])
plt.legend(models);

We plot the ROC curve by using the roc_curve function in the sklearn.metrics package to calculate ROC information. Once again, we create empty dictionaries to store ROC information, and then we loop over each of our models, adding the ROC curve information for each model to the pyplot object. We do this separately for training and validation data so we can compare models on both partitions. There are plenty of other assessments we could use, and we could also use other Python plotting packages like seaborn. Looking at the results we can see that the gradient boosting model is the champion on both training and validation data, and although the performance is better on the training data, we don’t see evidence of major overfitting, the validation and training curves seem similar in the plots.

#Calculate misclassification for 'champion' model on validation data
from sklearn.metrics import accuracy_score

cutoff = 0.5
valid_preds = [1 if valid_out['gradboost']['P_BAD1'][elem] > cutoff else 0 for elem in valid_out['gradboost'].index]

print('Misclassification Rate on Validation Data for Gradient Boosting at',cutoff,'cutoff:', 1-accuracy_score(valid_out['gradboost']['BAD'], valid_preds))

Before we declare the gradient boosting model the final champion model, let’s look at the misclassification rate at the 0.5 probability cutoff. This isn’t necessarily the best cutoff for all models on all data, but it is a good place to start when building and evaluating models. We import the accuracy_score function from sklearn.metrics, but before we can apply it to the output classification scores, we have to apply the 0.5 probability cutoff to the predicted probabilities generated earlier using the predict_proba() method. We could have used the predict() method to generate these class values instead, but it’s helpful in this example to explicitly choose a cutoff (mainly so we can emphasize that the cutoff affects the outcome and has an impact on the misclassification rate). We use Python list comprehension to create a list of class predictions on the validation data, and then we apply the accuracy_score function to this list, comparing it to true value of the target. Of course, we print the misclassification rather than the accuracy, so we see that the champion gradient boosting model is incorrect for about 9.34% of the observations in the validation data.

The last step is to start thinking about model deployment. To deploy these models, we just need to apply any data preprocessing and the champion model to the new data. It’s easy to score new cases in the same SAS Viya Workbench environment, we just keep the trained data-preprocessing/model objects in Python memory and apply the transform/predict methods to the new data. It gets a bit more complex if we want to deploy to a different environment; in this case we would want to pickle the Python model objects and then unpack them in a different environment.

References:

SAS Viya Workbench: Python Machine Learning using sasviya.ml and scikit-learn

Free course: Data Literacy Essentials

Get Started