Predicting customer churn in the energy sector with SAS Viya

1 Like

Introduction

Hi everyone! 👋

In this article, we’ll explore how to predict customer churn in the energy sector using SAS Viya. Specifically, I worked with data from Kaggle that provides insights into Small and Medium Enterprises (SMEs) served by the PowerCO energy supplier. The goal is to identify the key factors driving customer churn and build predictive models to target retention campaigns at customers with a high likelihood of churning. The project uses two datasets:

The first dataset, client_data, includes variables such as client ID, sales channel codes, electricity and gas consumption over the past 12 months, customer tenure, and details on potential product modifications. It also contains variables that help evaluate financial performance, such as forecasted energy consumption, client margins, and more. Of course, it includes the churn variable, indicating whether a client churned within the next three months. This serves as the target variable for training predictive models.
The second dataset, price_data, contains details about variable (usage-based) charges and fixed charges. Variable charges depend on energy consumption, while fixed charges reflect infrastructure costs to meet energy demand and handle peak loads. These charges are further categorized into off-peak, peak, and mid-peak periods, reflecting typical demand patterns throughout the day.

The project follows these key steps:

Data Preparation and Exploratory Data Analysis (EDA): Performed using SAS Studio flows and SAS Visual Analytics to clean, explore and transform the data, uncover insights, and ensure quality for modeling.
Predictive Modeling: Training several models in SAS Model Studio, selecting a champion model, and scoring test data.
Model Interpretation: Interpreting the champion model using SAS Model Studio interpretability tools, saving the interpretability data, and visualizing it in a Model Interpretability dashboard created in SAS Visual Analytics.
Retention Campaign Simulations: Simulating various retention campaign scenarios in SAS Studio and analyzing the overall profits resulting from different discount rates and campaign efficiencies in SAS Visual Analytics.

By the end of this journey, you’ll have a comprehensive understanding of how SAS Viya can be used to tackle churn prediction challenges in the energy sector. I hope this article inspires you to experiment with SAS Viya and apply its powerful capabilities to solve similar challenges in your industry!

Step 1: Data Preparation and Exploratory Data Analysis (EDA) in SAS Studio and VA

To start, we’ll import the data into the CAS server using SAS Data Explorer, accessible from the Applications menu by selecting Manage Data. Once Data Explorer is open, navigate to the Import tab and use the Local File Import section to import the two CSV files containing the client_data and price_data tables. These will be imported as CAS tables named client_data_raw and price_data_raw, respectively, and stored in the casuser caslib. This process not only loads the tables into memory but also saves them as sashdat files if the 'Save as in-memory data only' option is deselected, which is the default setting. You can preview the imported tables directly in Data Explorer by selecting them from the In-memory Data (available) or Sources tab.

Figure 1: Example of the Sample Data tab for the client_data_raw in-memory table. Figure 1: Example of the Sample Data tab for the client_data_raw in-memory table.

From this initial look at the two tables, the following observations can be made:

The date variables are incorrectly interpreted as character variables. During the data manipulation phase prior to the exploratory data analysis (EDA), these variables will be converted into SAS date formats to enable proper visualization.
Some categorical variables have been anonymized into hashed text strings to protect the privacy of the original data. To create more understandable visualizations, these hashed values will be mapped to human-readable labels, such as 'Channel 1', 'Channel 2,' etc., for the channel_sales variable.

To address these issues, I transitioned to SAS Studio (accessible via Develop Code and Flows in the Applications menu) to perform the required data manipulations and create useful tables for the EDA report. These steps were organized into a flow named pre_eda.flw. Flows are one of the primary point-and-click tools in SAS Studio, allowing you to combine prebuilt and custom steps to import various file types, query, transform, and export data efficiently.

The pre_eda flow consists of multiple sub-flows, with each one placed in a different swimlane to control the submission order.

It creates and loads into memory the following CAS tables:

client_data and price_data: These are the processed versions of the original client_data_raw and price_data_raw tables. The processing involves converting date columns to the SAS date type, removing potential duplicates, mapping categorical variables to more human-readable strings, and performing data quality checks, such as verifying that some numeric variables contain only positive values, ensuring consistency between related columns (e.g., gas consumption and gas availability), checking for logical date order, and validating that gross margins are not smaller than net margins.

Figure 2: Swimlanes for the creation of the client_data and price_data tables. Figure 2: Swimlanes for the creation of the client_data and price_data tables.

client_numeric_summary and price_numeric_summary: These tables contain summary statistics for the numeric variables in the client_data and price_data tables (by default, Cardinality, Minimum, Maximum, Mean, Number of Observations, Number of missing values, and Standard Deviation).
client_date_summary and price_date_summary: These tables provide summary statistics for the date variables in the client_data and price_data tables (Minimum and Maximum).

Figure 3: Swimlanes for the creation of the client_numeric_summary, price_numeric_summary, client_date_summary and price_date_summary tables. Figure 3: Swimlanes for the creation of the client_numeric_summary, price_numeric_summary, client_date_summary and price_date_summary tables.

id_obs_summary: A table containing the number of observations and unique IDs for the client_data and price_data tables.

Figure 4: Swimlane for the creation of the id_obs_summary table. Figure 4: Swimlane for the creation of the id_obs_summary table.

joined_data: A table resulting from an inner join between client_data and price_data, retaining only churn, channel_sales, and date_activ for the former.

Figure 5: Swimlane for the creation of the joined_data table. Figure 5: Swimlane for the creation of the joined_data table.

dummy_table: A table with six rows and a single variable, dummy_variable, containing values ranging from 0 to 5. The purpose of this table will be explained shortly.

Figure 6: Swimlanes for the creation of the dummy_table table and termination of the CAS session. Figure 6: Swimlanes for the creation of the dummy_table table and termination of the CAS session.

These tables were then used in a report in SAS Visual Analytics, accessible through the Explore and Visualize option in the Applications menu, to visualize the data and determine which types of transformations to perform on them before building predictive models for churn prediction. The report consists of four distinct pages:

1. Client EDA. This page analyzes the variables in the client_data table and includes, at the top, the number of unique IDs and two tables showing the summary statistics for numerical and date-type variables. At the bottom left, there is a bar chart for each categorical variable, with the option to filter them based on the value of the churn variable. On the bottom right, there is a correlation matrix to display the degree of correlation between pairs of numeric variables.

Figure 7: Client EDA dashboard. Figure 7: Client EDA dashboard.

2. Price EDA. This page analyzes the variables in the price_data table and has a very similar structure to the previous one, except for the Time Series Plot at the bottom left instead of the bar charts. This plot allows you to visualize the monthly trend (obtained by averaging across all IDs) of each price variable, which can be changed through the drop-down list above it. To create this Time Series Plot, it was necessary to use the previously created dummy_table. For more details, refer to this article I wrote, which explains in more depth how to create this plot.

Figure 8: Price EDA dashboard. Figure 8: Price EDA dashboard.

3. Joined data EDA. This page leverages the joined_data table to provide insights into customer churn and price trends. At the top, a pie chart illustrates the proportions of churners and non-churners among clients present in both tables. To the top right, a table lists the IDs with fewer than 12 observations in the price_data table, indicating missing price values for some months for these customers. At the bottom, a scrolling container showcases time series plots for all price variables, with average trends displayed separately for churners and non-churners. Additionally, the time series plots can be filtered based on the values of the channel_sales variable, enabling more granular analysis.

Figure 9: Joined data EDA dashboard. Figure 9: Joined data EDA dashboard.

4. Deep-dive on low-cardinality variables. This page focuses on three numeric variables from the client_data table that exhibit a low number of distinct values: nb_prod_act, num_years_antig, and forecast_discount_energy. The scrolling container displays, on the left, the frequency distributions of the original variables converted into categorical form, and on the right, the same distributions with rare categories grouped into a single category. The idea is that this grouping mitigates noise caused by infrequent values, allowing the model to focus on meaningful patterns and improve predictive performance.

Figure 10: Deep-dive on low-cardinality variables dashboard. Figure 10: Deep-dive on low-cardinality variables dashboard.

Based on the findings from the VA report, a dedicated flow (data_preparation.flw) was built in SAS Studio. This flow processes and transforms the data from two source tables (client_data and price_data) and combines them into a single resulting table, which serves as the input for SAS Model Studio to train machine learning models. Apart from the 'Create CAS Session' and 'Terminate CAS Session' swimlanes, the flow consists of three main swimlanes:

1. Client data preparation. This sub-flow starts with the client_data table and produces the client_data_final table through the following transformations:

- Removal of infrequent categories in the channel_sales and origin_up variables.

- Creation of contract_age_at_last_modif, has_modified, and days_to_end variables based on the available date variables.

- Conversion of num_years_antig, forecast_discount_energy, and nb_prod_act into categorical variables, by grouping rare classes.

Figure 11: Swimlanes for the creation of the CAS session and the client_data_final table. Figure 11: Swimlanes for the creation of the CAS session and the client_data_final table.

2. Price data preparation. This sub-flow processes the price_data table to create the price_data_final table, involving the following steps:

- Imputation of missing price values using the median for the corresponding months.

- Computation of summary statistics (minimum, maximum, mean, and maximum difference between consecutive months) to represent price trends.

Figure 12: Swimlane for the creation of the price_data_final table. Figure 12: Swimlane for the creation of the price_data_final table.

3. Joining the 2 prepared tables for Model Building. This sub-flow combines the prepared tables into the prepared_data table through the following steps:

- Merging the client_data_final and price_data_final tables.

- Removal of numeric variables to eliminate all correlation values greater than 0.9, reducing redundancy and enhancing model interpretability.

Figure 13: Swimlane for the creation of the prepared_data table. Figure 13: Swimlane for the creation of the prepared_data table.

Next, we will use the prepared_data table in Model Studio, accessible through the Build Models option in the Applications menu.

Step 2: Predictive Model Building in Model Studio

To begin developing models in Model Studio, a new project was created by clicking on New Project in the upper-right corner of the Projects page and selecting the prepared_data table as the data source. During project creation, clicking the Advanced button allows you to adjust settings such as performing random undersampling or modifying the data partitioning proportions. For example, I opted for an 80/10/10 train/validation/test split to allow the model to train on more data compared to the default settings. Once configured, clicking the Save button finalizes the project creation.

After opening the project, the roles of the id and churn variables were updated in the Data tab to Key and Target, respectively. Assigning the Key role to the id variable enables its use in the Model Interpretability dashboard to specify the instances to explain, as detailed later. Three pipelines were then built and executed:

Custom pipeline
This pipeline was manually constructed from a blank template to compare Logistic Regression, Random Forest, and Gradient Boosting models across two scenarios:
- Using the prepared_data table directly.
- Using an enriched version of the table by adding the Feature Machine node between the Data node and the Supervised Learning nodes. This node generates new features to fix common data quality issues, such as high skewness, missing values, outliers, high cardinality, and more.
Advanced template for class target with autotuning
This pipeline is one of the out-of-the-box templates you can choose from during project creation or when adding a new pipeline to the project.
SAS Automatically Generated Pipeline
This pipeline was created by selecting the Automatically generate the pipeline option when adding a new pipeline. SAS Viya uses automated machine learning to build the pipeline dynamically, adhering to a user-defined time limit. The resulting pipeline contains the top 5 models selected from a pool of candidates, which can be customized in Advanced Settings. For further details, refer to the documentation on Automated Pipeline Creation.

Figure 14: Custom pipeline. Figure 14: Custom pipeline.

Figure 15: Advanced template for class target with autotuning. Figure 15: Advanced template for class target with autotuning.

Figure 16: SAS Automatically Generated Pipeline. Figure 16: SAS Automatically Generated Pipeline.

After running the pipelines, model performance within each pipeline was compared using the Model Comparison node, while models from all pipelines were compared through the Pipeline Comparison tab. The chosen evaluation metric was the Area Under the Curve (AUC), selected by navigating to Project Settings via the cogwheel in the upper-right corner, then under Rules, setting the Class selection statistic to Area under curve (C statistic).

AUC was chosen because it evaluates the ranking quality of probability estimates, which should give high probability to the positive examples (true churners) and small probability to the negatives (non-churners). As AUC is a well-accepted ranking metric for unbalanced datasets, and the goal is to create a reliable scoring list for selecting customers to target in a retention campaign, ranking quality is more critical than predictive accuracy for this use case.

The Pipeline Comparison tab provides a summary of results after running the three pipelines:

Figure 17: Pipeline Comparison tab. Figure 17: Pipeline Comparison tab.

As shown, the champion model for each pipeline is displayed. Additionally, you can add challenger models to this tab for comparison. Simply right-click on the desired Supervised Learning node and select Add Challenger Model. For instance, by examining the Model Comparison nodes of each pipeline, I identified that the Random Forest model from the custom pipeline trained on the original data achieved the highest AUC score, excluding the Ensemble models. I chose to add this model as a challenger in the Pipeline Comparison tab:

Figure 18: Pipeline Comparison tab after adding a challenger model. Figure 18: Pipeline Comparison tab after adding a challenger model.

For this use case, I selected the Random Forest model as the final one to score customers based on their churn likelihood. While its performance is comparable to that of the Ensemble models, the Random Forest model also provides native insights into the relative importance of input variables by considering them collectively. This differs from the approach used in Ensemble nodes, where importance is estimated separately using one-level decision trees for each variable.

The scored data from the final model was saved to the scored_data.sashdat file and loaded into memory as the scored_data table. This was accomplished by adding a Save Data node as a child of the model node.

Figure 19: Save Data node. Figure 19: Save Data node.

Step 3: Model interpretability dashboard in VA

The Post-training Properties of the selected model node were configured to leverage SAS Viya's interpretability tools, including:

Variable Importance
Partial Dependence (PD) Plots
Individual Conditional Expectation (ICE) Plots
Local Interpretable Model-Agnostic Explanations (LIME)
Kernel Shapley values (HyperSHAP)

These model interpretability techniques are model-agnostic, meaning they can be applied to any model generated by a supervised learning node!

I decided to execute the node twice to provide insights into both the top five customers with the highest churn likelihood and the top five with the lowest churn likelihood, utilizing the Specify instances to explain section. Assigning the id variable the Key role during project creation allowed us to specify the instances for this analysis more easily.

In the first run, data for Variable Importance, PD and ICE Overlay Plots, LIME Explanations and HyperSHAP values were downloaded from the Results pane. In the second run, the same results were downloaded, except for Variable Importance, as it provides global interpretability rather than local insights and remains unchanged when only the instances for computing local interpretability tools are modified.

Finally, this data was used in SAS Visual Analytics (VA) to build a dashboard, offering a comprehensive understanding of the selected model and its predictions.

Figure 20: Model Interpretability dashboard. Figure 20: Model Interpretability dashboard.

The dashboard features a bar chart on the left side that displays the relative variable importance. On the right, a scrolling container contains two distinct boxes: one showcasing the top 5 clients with the highest churn likelihood, and the other highlighting the top 5 clients with the lowest churn likelihood. Each box includes the following:

Left side: A bar chart visualizing the HyperSHAP values for the ID selected in the drop-down list above. This same drop-down list determines which ID is displayed in the plots on the right.
Right side: The PD and ICE Overlay Plot for the top 5 most important variables, along with the LIME Explanations graph. In the overlay plot, the variable to display can be chosen via a drop-down list located above the corresponding plot.

For a detailed explanation of how the HyperSHAP values and LIME Explanations bar charts have been configured to sort the bars by their absolute values, check out this article that walks you through the entire process.

Step 4: Retention Campaign Scenarios in SAS Studio and VA

Starting from the scored_data table, I simulated various retention campaign scenarios aimed at incentivizing high-risk customers to remain with the energy supplier by offering targeted discounts.
This simulation was conducted on the test set of the scored_data table to evaluate the model's behavior on unseen data. It was executed using the SAS macro %simulate_profits, which calculates profits across various combinations of discount rates, discount efficiencies, and probability thresholds. The profit represents the average annual profit and has been calculated using the margin_net_pow_ele, which denotes the net margin on power subscriptions. This margin is assumed to refer to the average monthly margin and was therefore multiplied by 12 to derive the annual value.

How the Profit is Computed

Profit Without Campaign:
The baseline profit, if no retention campaign is conducted, is computed as:
```
Profit Without Campaign = Gains − Loss
```
where:
- Gains represents the total potential profit if no customers churn, therefore this value assumes a perfect scenario where all customers are retained.
- Loss represents the total profit lost due to customers who churned. This value accounts for the revenue lost because these customers are no longer contributing profit.

This value represents the baseline scenario in which no effort is made to prevent churn, and the business simply absorbs the loss from churned customers. It serves as a reference point for evaluating the effectiveness of any retention campaign.

Prevented Loss and Discount Loss:
For each scenario, the retention campaign targets all customers with a predicted churn likelihood exceeding the probability threshold.
- Prevented Loss: The profit gained from customers who were predicted to churn and actually churned, but would've been retained due to the campaign. This is scaled by the discount efficiency.
```
Prevented Loss = Sum of profits from targeted churners x Discount Efficiency
```
- Discount Loss: The total cost of offering discounts to targeted high-risk customers, broken down as follows:
  - For non-churners: The profit from customers who were predicted to churn but did not actually churn, multiplied by the discount rate, as they would've been targeted by the campaign.
```
Discount Loss (Non-Churners) = Sum of profits from targeted non-churners x Discount Rate
```
  - For Churners: The prevented loss scaled by the discount rate.
```
Discount Loss (Churners) = Prevented Loss x Discount Rate
```

The total discount loss is the sum of these two components:

Discount Loss = Discount Loss (Non-Churners) + Discount Loss (Churners)

Final Profit for a Scenario:
For each combination of parameters (discount rate, discount efficiency, and probability threshold), the profit is calculated as:
```
Profit = Gains − Loss + Prevented Loss − Discount Loss
```

A lower probability threshold expands the retention campaign to more customers, leading to a higher Discount Loss. Conversely, a higher probability threshold excludes some potential churners from the discount offer, increasing the risk of losing them and, therefore, the Loss term.

Simulation Analysis and Visualization

The SAS macro evaluates all possible combinations of parameters to generate a comprehensive profit simulation. The results include:

Worst Profit Table: Contains the baseline profit assuming no retention campaign is implemented.
Profit Scenarios Table: Stores the profits for all combinations of discount rates, discount efficiencies, and probability thresholds.
Max Profit Table: Identifies the optimal probability threshold for each scenario, maximizing the profit.

These tables are integrated into a VA dashboard, which allows for interactive exploration and comparison of various retention campaign scenarios. The parameters for these scenarios can be configured using drop-down lists. The dashboard provides insights into:

How profit varies for specific discount rates and efficiencies as the probability threshold changes.
The impact of retention campaigns on profit compared to the baseline, both in absolute and percentage terms.
The customers who should be targeted in the retention campaign for a given scenario.

Figure 21: Retention campaign scenarios dashboard. Figure 21: Retention campaign scenarios dashboard.

If we had an ideal model with a probability threshold p* that perfectly discriminates between churners and non-churners, the profit would increase until p* and then decrease until reaching an asymptotic value corresponding to the no-retention-campaign scenario.

In a real-world scenario, such an ideal p* does not exist, as there is some overlap between the two client classes. Consequently, increasing the probability threshold p by 0.05 (i.e., the step size I used) could lead to two extreme cases:

All clients with churn likelihood between and are churners: In this case, these clients are no longer classified as churners, resulting in a decrease in overall profit.
All clients with churn likelihood between and are non-churners: Here, these clients are already non-churners, so avoiding targeting them with a discount increases overall profit.

These dynamics explain the fluctuations observed in the profit curve shown in Figure 21.

By leveraging this simulation and visualization, the analysis highlights the most profitable strategies for retaining high-risk customers, helping the energy supplier optimize its retention campaigns effectively. Below are some final comments on the retention campaign scenarios:

The efficiency of the retention campaign may, of course, depend on competitors' tariffs, which in turn can vary based on the client's location. Therefore, it could be beneficial to incorporate this information into the model-building phase, when possible.
The retention campaign has its own costs, which should be considered when computing the overall profit and determining whether it's profitable to run.

Conclusion

Thank you everyone for reading this article on churn prediction and prevention in the energy sector using the SAS Viya platform, I hope you found it useful! 😊
Feel free to comment on this article or reach out to me via email (Mattia.Guglielmelli@sas.com) for any feedback or questions, and I encourage you to experiment with SAS Viya to explore how its powerful tools can help solve similar challenges in your own industry!

SAS Communities Library