Comparing Manage Variables and Variable Selection for Churn Prediction in SAS Viya

In this post, we will discuss the option of using the Manage Variable node in Model Studio. In the Machine Learning Using SAS Viya course we utilize the VARIABLE SELECTION node to help identify useful variables that are used for predicting the target event. The MANAGE VARIABLES node allows us to change properties like the role, level, and specific method whether that’s transforming and imputing an input variable. In the demonstration we will test the performance of using the manage variable node and the variable selection node.

We utilize the comms dataset for this demonstration; this data set consists of 56,000 rows and 128 columns. The telecommunication company has been having trouble with retention of our existing customer, and it can be costly to attract new customers. The goal is to determine which modeling approach will outperform the other. We will start by loading the data into Model Studio to get started.

Loading the Data

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The above image shows how you load the data into Model Studio under “New Project” tab, we call the pipeline "Comparison Model". We start off by naming our project, selecting the data type, in our project the type will be “Data Mining and Machine Learning”. We select a basic template for class target, and the data we are using is labeled “COMMSDATA”. It’s optional, but you can provide a brief description of the project. This is useful if you plan to share your project with your colleagues. Once we have the data loaded and we save our new project, we are then directed to Data tab to preview our data. Before we can build our model, we must select our target variable to be analyzed for our machine learning model.

From the above image, we selected “churn” as our target event of interest. The goal is to predict the probability that customers will churn or end business with the telecommunication company. Next, we will explore our data as it relates to target variables to look for variables of importance to the target.

The image above shows the DATA node and DATA EXPLORATION node. The DATA node defines all the information within the data such as formatting, metadata, and the target event. The DATA EXPLORATION node is responsible for displaying the statistical plots of the variables. Let’s look at some of our result for the DATA EXPLORATION node and see the insights from our input variables.

The above image shows our “Important Input” table. This table provides relative importance of our input variables as they relate to the target variable of “Churn”. We see the “curr_days_susp” variable has a relative importance of 1. We see that all the variables have some level of significance to the target variable such as ever_days_over_plan, delinq_indicator, pymts_late_ltd, avg_days_susp, calls_care_ltd, mou_onnet_pct_MOM, and times_susp. Now that we have a sense of important variables, we can start to build out our pipeline.

We select a new pipeline and label it as “Comparison Model” and next we build our model. We utilize the starter templates available to get a head start on our model building process. For this model we want to use a template we that utilize through our Machine Learning Using SAS Viya course.

We will create our new pipeline and save it; this will generate our pre-built pipeline for a faster start to our analysis.

We start off with a starter template; we utilize several data mining and preprocessing nodes. We use a replacement node, transformation node, and imputation node. After the Text Mining Node, we utilize the Manage Variables node, the split between the Manage Variable node and Variable Selection node occurs after the Text Mining node. This is where we will build the supervised learning node for model performance comparison for both the Manage Variable and Variable Selection nodes. Let’s look at our Variable Selection node to see which variables were rejected, the variable selection node uses several supervised methods of variable selection to reduce the number of inputs for analysis.

In the above image, we see the selection combination summary table. This table Shows which variables have been rejected and which ones have been selected for the model building effort We use the Fast Supervised selection and Linear Regression selection to determine whether a variable will be selected for analysis.

The above image shows the preprocessing nodes used and the split between the Manage Variables and Variable Selection nodes. The Replacement, Transformation, Imputation, and Text Mining nodes streamline data preparation in SAS Viya. Replacement cleans data by substituting missing or invalid values; the Transformation modifies variables through functions, recoding, or feature creation. The Imputation node fills missing data using statistical methods to preserve records; and Text Mining converts unstructured text into structured formats for analysis. Together, they form a unified pipeline that cleans, enriches, and structures data for advanced analytics. Let’s look at the variables that were rejected in the Manage Variable node.

We “right-click” over the Manage Variables node, and we select Manage Variables. This will allow us to change the variables that were selected to be Rejected, and change create “New Role” for the variables as “Inputs” for further analysis. Let’s look at the variables that were rejected, to move back input to include in the analysis.

In the image above, we select seven variables that were rejected during analysis, we include bill_data_usg_mo3, bill_data_usg_mo6, data_prem_chrgs_curr, mb_data_usg_mo1, mb_data_usg_mo2, mb_data_usg_mo3, and avg_data_prem_chrgs_3m. We include these variables for further analysis in the Manage Variables node. The data usage may help provide more insight into why customers are churning from the telecommunication company. Now let’s run the pipeline with the machine learning node.

Now let’s add some more supervised learning nodes to our model. We will add the gradient boosting and logistic regression node to both the Variable Selection node and the Manage Variables node as well. Before we run our pipeline, we need to select a list of variables that were rejected from the analysis to be included in the Manage Variables node for analysis.

The above image shows the “Comparison Model” pipeline that we created has successfully run. We mention early in the demonstration that we handle some data preprocessing before moving forward. After the data preprocessing is completed, we reach the split between our Variable Selection and the Manage Variables node. For consistency we use the same machine learning nodes for both variables managing nodes with the same hyperparameters. We utilize the Logistic Regression and Gradient Boosting nodes for this demonstration; the Logistic Regression node is responsible for probability of a binary outcome using a linear combination of predictors, estimating coefficients via maximum likelihood. The Gradient Boosting node builds an ensemble of decision trees sequentially, where each tree corrects errors from the previous ones, and ties to accurately provide a predictive accuracy for classification or regression tasks. These machine learning nodes will suffice for our demonstration as we want to provide the probability that a customer will churn from the telecommunication company.

Looking further into our machine learning nodes, we see the Gradient Boosting (1) node for the Manage Variables was selected as the “Champion Model”. The champion model is the top-performing model that is selected out of all the models used in the pipeline. Let’s look at the result from the Gradient Boosting model selected as the Champion Model.

Model Comparison

Now the pipeline has successfully run, we now need to look at the Model Comparison Node. The Model Comparison node provides an automated, visual and comprehensive comparison within the pipeline to identify the best-performing model-based metrics like KS statistics, lift, and ROC Curve.

Let’s look at the results provided by the Model Comparison node. All measures of assessment are computed for each of the available data partitions (train, validate, and test). You can also select which data partition to use for selecting the champion. By default the Champion Model is selected through the VALIDATE set. We right-click over the Model Comparison node and select Results.

In the Model Comparison window, we can see various benchmark metrics for each supervised learning node. We can see that our “Champion Model” was selected as Gradient Boosting (1) from Manage Variable node. The Gradient Boosting Node had an accuracy of 0.9442 with an KS 0.6994. The Logistic Regression (1) node had an accuracy of 0.9325 with a KS of 0.5831. From our Variable Selection node, the Gradient Boosting node had an accuracy of 0.9366 and a KS of 0.5884 and the Logistic Regression has an accuracy of 0.9170 and a KS of 0.5413. From the Model Comparison, we can conclude that using the Manage Variable node and including previously rejected variables can provide minimal improvement of accuracy by utilizing the Manage Variables node.

ROC Curve for Validation Dataset

On right pane you have the option to display all the models or one model at a time. You also have the option of only observing the train or validate set of the ROC Curve. Above, the image displays the ROC Curve of each model. We can see the Gradient Boosting (1) ROC Curve has the highest level of accuracy out of the other models. The next model with the high level of accuracy is the Gradient Boosting from the Variable Selection node.

The above plot shows the average squared error between our training and validations dataset. As the number of trees increases you want to see the “Average Squared Error” decrease. The minimum error for the VALIDATE partition is 0.051 and occurs for 89 trees, through early stopping.

ROC Curve

In the above image, we provide the ROC Curve our training and validate sets for the Gradient Boosting node. The ROC Curve plots the sensitivity (the true-positive rate) against the 1-specificity ( the false-positive rate ), which is based on measures from the confusion matrix.

For more insight into the Gradient Boosting node, we implement model interpretability plot options. We achieve this through Post-Hoc or Post-Analysis techniques that analyze feature importance, partial dependence, and individual predictions paths. For global interpretability we select the Variable importance, PD plots and for local interpretability we select the ICE plots, and LIME plot.

Model Variable Importance

This above image provides the model variable importance in respect to the target event of “churn”. This table is generated by the training performed on the original gradient boosting model. Starting off the “Total Days Over Plan” had a relative importance of 1, followed by “Number of Days Suspended” with a relative importance of 0.9865 and “Handset Age Group” with a relative importance of 0.829. We display the top 15 variables with the most relative importance to the target event.

The PD and ICE Overlay Plot displays the functional relationship between the input variable and the model prediction as well as the functional relationship between individual observations and the model prediction. This plot displays the partial dependence (PD) and the relationship between “ever_days_over_plan” and the predicted target for each individual observation. For observation “180000285”, the highest target prediction of the customer churning is 0.24 and it occurs when the “Total Days Over Plan” is 39.13 or 39 days over plan.

Conclusion

In conclusion, we demonstrated how the Manage Variables node in SAS Model Studio can complement traditional feature selection techniques to enhance predictive modeling. By reintegrating previously rejected variables, the analysis revealed a slight improvement in model performance, with the Gradient Boosting model from the Manage Variables pipeline emerging as the champion. The comparison highlights the value of thoughtful variable management in improving model accuracy, interpretability, and business insights. Ultimately, leveraging both the Manage Variables and Variable Selection nodes empowers data scientists to build more robust and effective churn prediction models in SAS Viya.

For more information:

Find more articles from SAS Global Enablement and Learning here.