Hi everyone! 👋
In this article, we’ll explore how to predict customer churn in the energy sector using SAS Viya. Specifically, I worked with data from Kaggle that provides insights into Small and Medium Enterprises (SMEs) served by the PowerCO energy supplier. The goal is to identify the key factors driving customer churn and build predictive models to target retention campaigns at customers with a high likelihood of churning. The project uses two datasets:
The project follows these key steps:
By the end of this journey, you’ll have a comprehensive understanding of how SAS Viya can be used to tackle churn prediction challenges in the energy sector. I hope this article inspires you to experiment with SAS Viya and apply its powerful capabilities to solve similar challenges in your industry!
To start, we’ll import the data into the CAS server using SAS Data Explorer, accessible from the Applications menu by selecting Manage Data. Once Data Explorer is open, navigate to the Import tab and use the Local File Import section to import the two CSV files containing the client_data and price_data tables. These will be imported as CAS tables named client_data_raw and price_data_raw, respectively, and stored in the casuser caslib. This process not only loads the tables into memory but also saves them as sashdat files if the 'Save as in-memory data only' option is deselected, which is the default setting. You can preview the imported tables directly in Data Explorer by selecting them from the In-memory Data (available) or Sources tab.
Figure 1: Example of the Sample Data tab for the client_data_raw in-memory table.
From this initial look at the two tables, the following observations can be made:
To address these issues, I transitioned to SAS Studio (accessible via Develop Code and Flows in the Applications menu) to perform the required data manipulations and create useful tables for the EDA report. These steps were organized into a flow named pre_eda.flw. Flows are one of the primary point-and-click tools in SAS Studio, allowing you to combine prebuilt and custom steps to import various file types, query, transform, and export data efficiently.
The pre_eda flow consists of multiple sub-flows, with each one placed in a different swimlane to control the submission order.
It creates and loads into memory the following CAS tables:
Figure 2: Swimlanes for the creation of the client_data and price_data tables.
Figure 3: Swimlanes for the creation of the client_numeric_summary, price_numeric_summary, client_date_summary and price_date_summary tables.
Figure 4: Swimlane for the creation of the id_obs_summary table.
Figure 5: Swimlane for the creation of the joined_data table.
Figure 6: Swimlanes for the creation of the dummy_table table and termination of the CAS session.
These tables were then used in a report in SAS Visual Analytics, accessible through the Explore and Visualize option in the Applications menu, to visualize the data and determine which types of transformations to perform on them before building predictive models for churn prediction. The report consists of four distinct pages:
1. Client EDA. This page analyzes the variables in the client_data table and includes, at the top, the number of unique IDs and two tables showing the summary statistics for numerical and date-type variables. At the bottom left, there is a bar chart for each categorical variable, with the option to filter them based on the value of the churn variable. On the bottom right, there is a correlation matrix to display the degree of correlation between pairs of numeric variables.
Figure 7: Client EDA dashboard.
2. Price EDA. This page analyzes the variables in the price_data table and has a very similar structure to the previous one, except for the Time Series Plot at the bottom left instead of the bar charts. This plot allows you to visualize the monthly trend (obtained by averaging across all IDs) of each price variable, which can be changed through the drop-down list above it. To create this Time Series Plot, it was necessary to use the previously created dummy_table. For more details, refer to this article I wrote, which explains in more depth how to create this plot.
Figure 8: Price EDA dashboard.
3. Joined data EDA. This page leverages the joined_data table to provide insights into customer churn and price trends. At the top, a pie chart illustrates the proportions of churners and non-churners among clients present in both tables. To the top right, a table lists the IDs with fewer than 12 observations in the price_data table, indicating missing price values for some months for these customers. At the bottom, a scrolling container showcases time series plots for all price variables, with average trends displayed separately for churners and non-churners. Additionally, the time series plots can be filtered based on the values of the channel_sales variable, enabling more granular analysis.
Figure 9: Joined data EDA dashboard.
4. Deep-dive on low-cardinality variables. This page focuses on three numeric variables from the client_data table that exhibit a low number of distinct values: nb_prod_act, num_years_antig, and forecast_discount_energy. The scrolling container displays, on the left, the frequency distributions of the original variables converted into categorical form, and on the right, the same distributions with rare categories grouped into a single category. The idea is that this grouping mitigates noise caused by infrequent values, allowing the model to focus on meaningful patterns and improve predictive performance.
Figure 10: Deep-dive on low-cardinality variables dashboard.
Based on the findings from the VA report, a dedicated flow (data_preparation.flw) was built in SAS Studio. This flow processes and transforms the data from two source tables (client_data and price_data) and combines them into a single resulting table, which serves as the input for SAS Model Studio to train machine learning models. Apart from the 'Create CAS Session' and 'Terminate CAS Session' swimlanes, the flow consists of three main swimlanes:
1. Client data preparation. This sub-flow starts with the client_data table and produces the client_data_final table through the following transformations:
- Removal of infrequent categories in the channel_sales and origin_up variables.
- Creation of contract_age_at_last_modif, has_modified, and days_to_end variables based on the available date variables.
- Conversion of num_years_antig, forecast_discount_energy, and nb_prod_act into categorical variables, by grouping rare classes.
Figure 11: Swimlanes for the creation of the CAS session and the client_data_final table.
2. Price data preparation. This sub-flow processes the price_data table to create the price_data_final table, involving the following steps:
- Imputation of missing price values using the median for the corresponding months.
- Computation of summary statistics (minimum, maximum, mean, and maximum difference between consecutive months) to represent price trends.
Figure 12: Swimlane for the creation of the price_data_final table.
3. Joining the 2 prepared tables for Model Building. This sub-flow combines the prepared tables into the prepared_data table through the following steps:
- Merging the client_data_final and price_data_final tables.
- Removal of numeric variables to eliminate all correlation values greater than 0.9, reducing redundancy and enhancing model interpretability.
Figure 13: Swimlane for the creation of the prepared_data table.
Next, we will use the prepared_data table in Model Studio, accessible through the Build Models option in the Applications menu.
To begin developing models in Model Studio, a new project was created by clicking on New Project in the upper-right corner of the Projects page and selecting the prepared_data table as the data source. During project creation, clicking the Advanced button allows you to adjust settings such as performing random undersampling or modifying the data partitioning proportions. For example, I opted for an 80/10/10 train/validation/test split to allow the model to train on more data compared to the default settings. Once configured, clicking the Save button finalizes the project creation.
After opening the project, the roles of the id and churn variables were updated in the Data tab to Key and Target, respectively. Assigning the Key role to the id variable enables its use in the Model Interpretability dashboard to specify the instances to explain, as detailed later. Three pipelines were then built and executed:
Custom pipeline
This pipeline was manually constructed from a blank template to compare Logistic Regression, Random Forest, and Gradient Boosting models across two scenarios:
Advanced template for class target with autotuning
This pipeline is one of the out-of-the-box templates you can choose from during project creation or when adding a new pipeline to the project.
SAS Automatically Generated Pipeline
This pipeline was created by selecting the Automatically generate the pipeline option when adding a new pipeline. SAS Viya uses automated machine learning to build the pipeline dynamically, adhering to a user-defined time limit. The resulting pipeline contains the top 5 models selected from a pool of candidates, which can be customized in Advanced Settings. For further details, refer to the documentation on Automated Pipeline Creation.
Figure 14: Custom pipeline.
Figure 15: Advanced template for class target with autotuning.
Figure 16: SAS Automatically Generated Pipeline.
After running the pipelines, model performance within each pipeline was compared using the Model Comparison node, while models from all pipelines were compared through the Pipeline Comparison tab. The chosen evaluation metric was the Area Under the Curve (AUC), selected by navigating to Project Settings via the cogwheel in the upper-right corner, then under Rules, setting the Class selection statistic to Area under curve (C statistic).
AUC was chosen because it evaluates the ranking quality of probability estimates, which should give high probability to the positive examples (true churners) and small probability to the negatives (non-churners). As AUC is a well-accepted ranking metric for unbalanced datasets, and the goal is to create a reliable scoring list for selecting customers to target in a retention campaign, ranking quality is more critical than predictive accuracy for this use case.
The Pipeline Comparison tab provides a summary of results after running the three pipelines:
Figure 17: Pipeline Comparison tab.
As shown, the champion model for each pipeline is displayed. Additionally, you can add challenger models to this tab for comparison. Simply right-click on the desired Supervised Learning node and select Add Challenger Model. For instance, by examining the Model Comparison nodes of each pipeline, I identified that the Random Forest model from the custom pipeline trained on the original data achieved the highest AUC score, excluding the Ensemble models. I chose to add this model as a challenger in the Pipeline Comparison tab:
Figure 18: Pipeline Comparison tab after adding a challenger model.
For this use case, I selected the Random Forest model as the final one to score customers based on their churn likelihood. While its performance is comparable to that of the Ensemble models, the Random Forest model also provides native insights into the relative importance of input variables by considering them collectively. This differs from the approach used in Ensemble nodes, where importance is estimated separately using one-level decision trees for each variable.
The scored data from the final model was saved to the scored_data.sashdat file and loaded into memory as the scored_data table. This was accomplished by adding a Save Data node as a child of the model node.
Figure 19: Save Data node.
The Post-training Properties of the selected model node were configured to leverage SAS Viya's interpretability tools, including:
These model interpretability techniques are model-agnostic, meaning they can be applied to any model generated by a supervised learning node!
I decided to execute the node twice to provide insights into both the top five customers with the highest churn likelihood and the top five with the lowest churn likelihood, utilizing the Specify instances to explain section. Assigning the id variable the Key role during project creation allowed us to specify the instances for this analysis more easily.
In the first run, data for Variable Importance, PD and ICE Overlay Plots, LIME Explanations and HyperSHAP values were downloaded from the Results pane. In the second run, the same results were downloaded, except for Variable Importance, as it provides global interpretability rather than local insights and remains unchanged when only the instances for computing local interpretability tools are modified.
Finally, this data was used in SAS Visual Analytics (VA) to build a dashboard, offering a comprehensive understanding of the selected model and its predictions.
Figure 20: Model Interpretability dashboard.
The dashboard features a bar chart on the left side that displays the relative variable importance. On the right, a scrolling container contains two distinct boxes: one showcasing the top 5 clients with the highest churn likelihood, and the other highlighting the top 5 clients with the lowest churn likelihood. Each box includes the following:
For a detailed explanation of how the HyperSHAP values and LIME Explanations bar charts have been configured to sort the bars by their absolute values, check out this article that walks you through the entire process.
Starting from the scored_data table, I simulated various retention campaign scenarios aimed at incentivizing high-risk customers to remain with the energy supplier by offering targeted discounts.
This simulation was conducted on the test set of the scored_data table to evaluate the model's behavior on unseen data. It was executed using the SAS macro %simulate_profits, which calculates profits across various combinations of discount rates, discount efficiencies, and probability thresholds. The profit represents the average annual profit and has been calculated using the margin_net_pow_ele, which denotes the net margin on power subscriptions. This margin is assumed to refer to the average monthly margin and was therefore multiplied by 12 to derive the annual value.
Profit Without Campaign:
The baseline profit, if no retention campaign is conducted, is computed as:
Profit Without Campaign = Gains − Loss
where:
Gains represents the total potential profit if no customers churn, therefore this value assumes a perfect scenario where all customers are retained.
Loss represents the total profit lost due to customers who churned. This value accounts for the revenue lost because these customers are no longer contributing profit.
This value represents the baseline scenario in which no effort is made to prevent churn, and the business simply absorbs the loss from churned customers. It serves as a reference point for evaluating the effectiveness of any retention campaign.
Prevented Loss and Discount Loss:
For each scenario, the retention campaign targets all customers with a predicted churn likelihood exceeding the probability threshold.
Prevented Loss = Sum of profits from targeted churners x Discount Efficiency
Discount Loss (Non-Churners) = Sum of profits from targeted non-churners x Discount Rate
Discount Loss (Churners) = Prevented Loss x Discount Rate
The total discount loss is the sum of these two components:
Discount Loss = Discount Loss (Non-Churners) + Discount Loss (Churners)
Final Profit for a Scenario:
For each combination of parameters (discount rate, discount efficiency, and probability threshold), the profit is calculated as:
Profit = Gains − Loss + Prevented Loss − Discount Loss
A lower probability threshold expands the retention campaign to more customers, leading to a higher Discount Loss. Conversely, a higher probability threshold excludes some potential churners from the discount offer, increasing the risk of losing them and, therefore, the Loss term.
The SAS macro evaluates all possible combinations of parameters to generate a comprehensive profit simulation. The results include:
These tables are integrated into a VA dashboard, which allows for interactive exploration and comparison of various retention campaign scenarios. The parameters for these scenarios can be configured using drop-down lists. The dashboard provides insights into:
Figure 21: Retention campaign scenarios dashboard.
If we had an ideal model with a probability threshold p* that perfectly discriminates between churners and non-churners, the profit would increase until p* and then decrease until reaching an asymptotic value corresponding to the no-retention-campaign scenario.
In a real-world scenario, such an ideal p* does not exist, as there is some overlap between the two client classes. Consequently, increasing the probability threshold p by 0.05 (i.e., the step size I used) could lead to two extreme cases:
These dynamics explain the fluctuations observed in the profit curve shown in Figure 21.
By leveraging this simulation and visualization, the analysis highlights the most profitable strategies for retaining high-risk customers, helping the energy supplier optimize its retention campaigns effectively. Below are some final comments on the retention campaign scenarios:
Thank you everyone for reading this article on churn prediction and prevention in the energy sector using the SAS Viya platform, I hope you found it useful! 😊
Feel free to comment on this article or reach out to me via email (Mattia.Guglielmelli@sas.com) for any feedback or questions, and I encourage you to experiment with SAS Viya to explore how its powerful tools can help solve similar challenges in your own industry!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.