In today's society understanding customer churn rate is essential to retaining customers and improving the overall performance of a company. Business industries churns can determine the path of a company such as soliciting promotions, addressing specific issues of customer satisfaction, and advertising discounts to current customers or new customers.
In this post, we will be looking at customer’s churn rate of a fictious bank company provide through Kaggle. We want to better understand some of the leading factors as to why a customer would churn from the bank. The dataset consists of 14 columns with 10,001 observations. Some variables in this dataset consist of the following:
To start, we will utilize the SAS Viya software to help in the analysis of the data, we start by identifying our target variable. The target variable for this analysis is going to be the “Exited variable, the reasoning for this is because this variable is a categorical binary variable.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
After we establish the target variable for the pipeline, we next want to look at the data using the “Data Exploration”. This node we tell us more about what variable are important and give us a better insight into the variables within the data.
Once the pipeline finish running, we right-click on the “Data Exploration” node and look at the result. The result will consist of visualization regarding the variables such as, number of products, Tenure, Age. This will help in providing a better understanding of how to build out the pipeline.
Relative Importance
From the figure above, we can gain insight into the variables with relative importance. For example, the number of products holds 1.0 of relative importance and Age has 0.97 % of relative importance. We also see some important but less than 50 % such as “IsActiveMember” Balance, Geography , and Tenure.
From the figure above, we look at the cardinality of the data to understand the levels of occurrence. Cardinality is the number of distinct values that a data attribute can take in a dataset. We see that there is high cardinality for Tenure, Geography, and Number of Products (NumOfProduct). We see medium cardinality for “HasCrCard” and IsActiveMember with Number of levels between 2 & 4. We can assume that the high cardinality for “Tenure” can be explained by the number of observations within the dataset of customers. Next, we look at “Exited” frequency percentage of customer’s leaving the bank company.
In the above figure, we look at the how many customers have churned (1 = yes, 0 = no) from the company. We see that the banking company was able to retain 80% of customers while only 20% churned from the company. We want to understand why customer may have churned and understand how to evaluate the customer's retention level to approve above 80%. Now, we need to replacing outliers by using the “Replacement” node in the section.
The “Replacement” node allows us to replace and remove any outliers and unknown class levels with specified value. This helps with tuning out any possible errors that may lie within the dataset. Nex, we perform a transformation node on the dataset, this node applies numerical or binning transformation to the input variables.
For each transformation a new variable is created as was done with imputation. With transformations the prefix will depend on the transformation that was used. For example, variables that were log transformed will begin with LOG_. The default for interval input method is set for LOG and the original variables Are not deleted from the data set but by default they are by default set to Rejected, so they are not used in the model. The remaining input will stay in their default settings with the rare cutoff and weight of evidence (WOE) adjustment value with remain at 0.5.
After setting the parameters for the transformation node, we want to change the interval variables metadata “Role” to Log. The reasoning is to help with reducing the skewness and help with the distribution of the data. After the transformation node is successfully run, we can look at the input variable statistics, transformed variable summary.
From the above figure, we look at the interval variables to check for the skewness of the replacement node. Skewness measures the symmetry of the variable’s distribution, which stretched the distribution toward the right or left tail. For replacement age variable the skewness is 1 suggesting ideal distribution. For the replacement of balance, credit score, and estimated salary consist of negative skewness which suggest closer to – 2 to +2 which are generally accepted values for skewness.
From the above figure, the transformation summary shows the input variables that were replace and the transformation formula used for each variable. The formula, the reasoning for doing n +1 is to avoid any error in the case that the variable input is equal to zero due to metadata limits that were established to eliminate negative values for our interval variables.
Now that we have handle the transformation of the data, we want to look at the missing values for class and interval input using the imputation node. For the imputation node we keep the inputs setting with the default parameters.
For imputation, the class input default method will be none, and for interval inputs we kept the inputs at default method of mean with data limits for calculating values for all data with a data limit percentage of 5 %. Before we look looking at the supervised learning nodes section and what nodes were selected, we look at variable selection node. The variable selection node performs unsupervised and several supervised methods of variable selection to reduce the number of inputs.
The variable selection node helps identifies input variables; it is helpful for making useful predictions for the target variable. The information gathered from the input variables can then be evaluated in more detail from the supervised learning nodes.
Once the pipeline has run successfully when can evaluate the supervised learning nodes that were selected in the model comparison node. The logistic regression node attempts to predict the value of a binary or nominal response variable. The reasoning behind choosing the logistic regression node is it has the capability to approximate the probability whether an individual observation belongs to the level of interest to the target variable. The decision tree node uses the values of one or more predictor data items to predict the values response data item. Another advantage of using the decision tree node is the treatment of missing data, which can be handled through a set of rules to help generate predictions from the target variable. The gradient boosting model provides a boosting approach that resamples the analysis data set several times to generate prediction results that form a weighted average of the re-sample data set. Lastly, we look at neural network mimics the human brain, it consists of predictors, with hidden layers, a target (or output) layer, and the connection between each of them.
We are only going to look at the top two models the Logistic regression node and the gradient boosting (Champion Model) node. As both respectfully had an accuracy of 86% (GB) and 85 % (LR).
Gradient Boosting
Looking at the gradient boosting model for the variable importance, we see that age has relative importance of 1 and the number of products with relative importance of 74% to the target variable. This would suggest that looking at the ages of the bank members and the number of products they my possesses with the company
The error plot for the average squared error shows training error decreases as the number of trees increases, but if the validation error shows an increase after a decrease, this could be a sign of overfitting. For this model, the minimum error for the VALIDATE partition is 0.104 and occurs for 27 trees, so the validation error is still decreasing at the last tree.
Logistic Regression
We look at the logistic regression results to help gain some more insight into the data. The logistic regression node only provided an accuracy 1 % less than the gradient boosting node which would still be deemed acceptable at 85 %.
From the above figure, we look at the t values by parameters bar chart plot. This chart gives insight into the importance of with blue (+) corresponding to positive predictive probability and yellow (-) which shows a negative predictive probability to the target variable. From the bar plot we see that a few variables have predictive probability to the target such as Age, IsActiveMember, Geography: Germany, and Gender: Female. This helps in understanding what could be the cause in the customer’s churn.
From the above figure, we look at the lift report to understand the cumulative lift related to the target event. This metric is used to evaluate how well the model identifies the positive cases (people who exited) in the population. The higher the cumulative lift, the better the model is at identifying these responders, especially in the higher predicted probability groups. Cumulative Lift of 3.76 for the validate partition and 3.54 for the train partition shows that the model is very good at identifying people who are likely to exit, particularly in the top 10% of the data (according to the model's predictions). This is valuable for targeting resources, as you can focus on this top 10% group where you're most likely to find people who exited.
Conclusion
In conclusion, the knowledge gained from performing a machine model pipeline provide tremendous insight into understanding why a customer would churn from the bank. By using the supervised learning nodes (gradient boosting and logistic regression) we gained insightful knowledge into key variables that could help in reducing the churn rate, which would improve the retention percent to 90%. For example, targeting potentially younger customer and focusing ad promotion on the geographical struggles in the listed regions in the data set.
For more information:
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.