BookmarkSubscribeRSS Feed

Gradient Boosting in Insurance Modeling

Started a week ago by
Modified a week ago by
Views 85

Introduction

 

In the insurance sector, precise pricing and effective risk modeling are essential for both profitability and long-term stability. Insurers need to meticulously evaluate numerous factors to set premiums that accurately reflect the risk level of each policyholder.

 

While traditional models like GLMs and GAMs are robust tools that have been used extensively in actuarial studies, they often face difficulties in identifying complex interactions among numerous, highly overlapping risk features. This limitation stems mainly from their linear or additive structures, which may not effectively capture the relationships between variables. Additionally, these models usually require explicit specification of interaction terms, which can be challenging when dealing with a large number of potential interactions or when the nature of these interactions is not predefined.

 

Recent advancements have seen a shift towards sophisticated machine learning methods such as gradient boosting, which offer enhanced predictive capabilities. Gradient boosting is a robust ensemble learning technique that merges multiple weak predictive models, usually decision trees, to form a strong overall model.

 

This allows us to increasingly concentrate on the harder-to-predict cases, learning from the cumulative mistakes. The ultimate model is a weighted aggregate of all the weak learners, with each one striving to enhance the performance of the preceding models.

 

Ensembles

 

Gradient Boosting Model owe its moorings to the Ensemble methodology. The ensemble methodology was the direct outcome of the perennial quest to answer how we can determine, in advance, which algorithm will perform best for a specific problem. One key insight we have gained after over three decades of experimentation is that a simpler and more effective way to enhance model accuracy, rather than meticulously selecting a single algorithm, is to combine multiple models into ensembles.

 

In the early 1990s, several researchers independently discovered that ensembles could improve classification performance. The most significant early advancements were made by Breiman in 1996 with Bagging (Bootstrap Aggregating), and by Freund and Schapire in 1996 with AdaBoost (Adaptive Boosting).

 

Creating an ensemble involves building diverse models and then combining their predictions. Component models can be generated by altering case weights, data values, guidance parameters, variable subsets, or partitions of the input space. While combination can be achieved through voting, it is mainly done by weighting the model estimates.

 

For instance, Bagging bootstraps the training dataset to create varied decision trees and then aggregates their predictions through majority voting or averaging. Random Forest introduces a random element to increase the diversity among the combined trees. AdaBoost iteratively builds models by adjusting case weights—giving more weight to cases with large errors and less to accurately predicted ones—and combines the models' predictions using a weighted sum. Gradient Boosting expanded the AdaBoost algorithm to accommodate a wide range of error functions for both regression and classification tasks.

 

Gradient Boosting

 

The concept of boosting is based on the hypothesis boosting problem posed by Kearns (1988) and Valiant (1989). In essence, the hypothesis boosting problem explores whether an efficient learning algorithm that generates a weak hypothesis or weak learner performs just a bit better than random guessing implies the existence of another efficient algorithm capable of producing a highly accurate hypothesis or strong learner.

 

In simple words, hypothesis boosting involves the strategy of filtering observations, retaining those that the weak learner can manage, and then developing additional weak learners to tackle the more challenging observations that remain. Algorithms designed to achieve this transformation came to be referred to as "boosting" algorithms.

 

The first highly successful implementation of boosting was AdaBoost. AdaBoost and similar algorithms were first reinterpreted within a statistical framework by Breiman, who referred to them as Adaptive Reweighting and Combining or ARCing algorithms.

 

Friedman (1999) further advanced this framework, introducing it as Gradient Boosting Machines, which later became known popularly as Gradient Boosting. In this approach boosting was redefined as a numerical optimization problem, aiming to minimize a model's loss by incrementally adding weak learners through a gradient descent-like procedure. The algorithm is characterized as a stage-wise additive model, where one new weak learner is added at a time, while the existing weak learners in the model remain unchanged.

 

Mechanics of Gradient Boosting

 

Gradient boosting consists of three key components:

 

  1. A loss function that needs to be optimized. The choice of loss function depends on the specific problem being addressed. A significant advantage of the gradient boosting framework is its versatility; it does not require a new boosting algorithm for each different loss function. Instead, it is sufficiently general to accommodate any differentiable loss function.
  2. A weak learner or base-learner that makes predictions. In gradient boosting, decision trees are commonly employed as the weak learners. Specifically, regression trees are utilized because they produce continuous values at each split. These values can be aggregated, enabling the subsequent models to adjust the residuals of the predictions by adding their outputs to the previous ones.
  3. An additive model that incorporates weak learners to reduce the loss function. Gradient boosting builds a strong learner through numerical optimization. The goal is to reduce the model's loss by incrementally adding weak learners through a gradient descent process. Furthermore, as a new weak learner is added sequentially, the existing weak learners in the model remain fixed and unaltered.

 

Loss Function

 

Let us assume that our training data consists of N cases, which we denote mathematically as:

 

01_SoumitraDas_bl03_2024_Eq01.png

 

The goal of gradient boosting is to iteratively constructs a collection of functions

 

02_SoumitraDas_bl03_2024_Eq03.png

 

by minimizing the expected value of the loss function L(Y, F(X)).

 

The functions

 

03_SoumitraDas_bl03_2024_Eq03.png

 

are the iterative estimates of F(X), where the function F(X) associates each instance X, the vector of inputs, with its corresponding output value Y.

 

In other words,

 

04_SoumitraDas_bl03_2024_Eq02-300x108.png

 

We can reformulate the estimation problem using expectations and simplify estimation of F(X) by limiting the search space to a specific parametric family of functions such that:

 

05_SoumitraDas_bl03_2024_Eq04-300x106.png

 

The response variable Y can originate from various distributions, which necessitates the use of different loss functions L(.). In addition, since we do not assume any specific form for the true functional relationship F(X) or for its estimate, it is not possible to solve this algebraically. Rather, iterative numerical methods are used for estimation of F(X).

 

Optimization

 

If M is the number of iterations, then the estimate of F(X) is given as:

 

06_SoumitraDas_bl03_2024_Eq05-300x104.png

 

where,

 

07_SoumitraDas_bl03_2024_Eq06-300x115.png

 

is the initial guess, and

 

08_SoumitraDas_bl03_2024_Eq07.png

 

are the incremental values obtained at each iteration. These are also termed as boosts.

 

In order to obtain feasible solution, the subsequent iterations resolve to minimizing a weighted sum of base-learner or weak-learner functions h(X,θ), which are models of the ensembles (e.g., decision trees).

 

Therefore, at iteration t we can rewrite the estimation of F(X):

 

09_SoumitraDas_bl03_2024_Eq08-300x52.png

 

Where ρt is the optimal step-size for each iteration.

 

After the initial step all the subsequent models at iteration t attempt to minimize

 

10_SoumitraDas_bl03_2024_Eq09.png

 

So, instead of directly solving the optimization problem, each weak-learner function, ht(X)), can be viewed as the steepest gradient descent to decrease the value of the empirical loss function L(.). The value of ρt is computed by using the line search optimization approach, an iterative technique used to locate a local minimum of a multidimensional nonlinear function by utilizing the gradients of the function.

 

Each model 11_SoumitraDas_bl03_2024_Eq10.png is trained on a new dataset 12_SoumitraDas_bl03_2024_Eq11.png where the pseudo-residuals ri,t is calculated as:

 

13_SoumitraDas_bl03_2024_Eq12-300x94.png

 

The above equation brings forth an important point. Namely that our loss function chosen must be continuous and differentiable in order obtain the pseudo-residuals.

 

Furthermore, the incorporation of the residuals predicted by a weak model into an existing model's prediction helps steer the model closer to the correct target. Repeatedly adding these residuals enhances the overall accuracy of the model's predictions.

 

Regularization

 

The algorithm can be prone to overfitting if the iterative process is not adequately regularized. A common method to regularize gradient boosting is to apply shrinkage (like L1 and L2 regularizations available in the gradient boosting node in SAS Viya), which reduces the impact of each gradient descent step (ρt). The rationale behind this approach is that making numerous small improvements to a model is more effective than making a few large adjustments.

 

Moreover, additional regularization can be attained by restricting the complexity of the trained models. For decision trees, this can be done by limiting the depth of the trees or setting a minimum number of instances required to split a node.

 

Early stopping is another effective method for enhancing the generalization capabilities of the model being estimated. This technique employs a validation data to identify the optimal number of iterations needed to construct a model that generalizes well to unseen data.

 

Lastly, gradient boosting variations often include parameters that introduce randomness into the base learners. This can enhance the ensemble's generalization by employing techniques like random subsampling without replacement.

 

Interpretation

 

For any resulting model it is immensely advantageous to be able to interpret the results. A common task is to determine variable importance. In ensembles, performing feature selection can be challenging because we cannot distinguish the main effects from the interaction effects. Conceptually the measure variable importance is based on the number of times a variable is selected for splitting, as well the improvement to the model ‘s performance as a result of the splits.

 

Partial Dependence plot is another greater interpretative tool in gradient boosting. Partial dependence illustrates the impact of a variable on the modeled response after averaging out the effects of all other explanatory variables. Although the precise method involves numerically integrating other variables over a suitable grid of values, but this can be computationally intensive. Consequently, a simpler approach is often employed, where the marginalized variables are fixed at a constant value, typically their sample mean.

 

Suitability for Insurance Data

 

  1. Handling high-dimensional, heterogeneous data. Insurance data typically encompasses a diverse array of features, such as categorical variables (e.g., vehicle type, occupation), numerical variables (e.g., age, income), and potentially intricate interactions among these variables. Gradient boosting models are well-suited to manage this heterogeneous data and can automatically detect non-linear relationships and feature interactions, a task that is often challenging for traditional linear models.
  2. Robustness to outliers and missing data. Insurance datasets often include outliers (e.g., extremely high claims) and missing values, which can adversely affect the performance of certain modeling techniques. Gradient boosting models are typically resilient to these issues due to their ensemble structure and the inherent ability of decision trees to handle missing data, offering robustness against data irregularities.
  3. Interpretability and feature importance. Although gradient boosting models are technically complex, they provide a degree of interpretability by allowing the calculation of feature importance scores. These scores quantify the relative contribution of each input feature to the final prediction, offering valuable insights into key risk factors. This information can be used to refine underwriting guidelines, adjust pricing strategies, and improve risk assessment processes.

 

Gradient Boosting Node in SAS Dynamic Actuarial Modeling

 

The Gradient Boosting node in SAS Dynamic Actuarial Modeling solution builds predictive models using the gradient boosting methodology outlined above. The node constructs multiple decision trees, typically using independent samples of the data without replacement and improves its predictions by minimizing a chosen loss function.

 

14_SoumitraDas_bl03_2024_DAMAdvTemplt.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

15_SoumitraDas_bl03_2024_GBMBasicED-133x300.png

 

 

 

The basic options for the node allow us to specify the number of iterations, the subsample rate, L1 and L2 regularizations amongst other things. The number iterations are equal to the number of trees specified for both interval and binary target variables.

 

The subsample rate indicates the percentage of training observations used to train each tree. In each iteration, a distinct training sample is used. During the same iteration, all trees are trained on the identical training data.

 

 

 

 

 

 

16_SoumitraDas_bl03_2024_GBMTreeSplitED-112x300.png

 

 

 

 

 

Since the core of the algorithm relies on decision trees, various option related to decision tree-splitting options are also available. Some of these include the maximum number of branches, maximum tree depth, minimum leaf size, how to treat missing values, and so on.

 

 

 

 

 

 

 

17_SoumitraDas_bl03_2024_GBMPostTrainingED-102x300.png

For interpretability purposes, the gradient boosting node allows us to generate variable importance of input variables and partial dependency plots that illustrates the functional relationship between the input variable and the model's prediction. In addition, we can also instruct the node to present individual conditional expectation (ICE) plots, which can uncover intriguing subgroups and interactions among model variables.

 

LIME or locally interpretable model-agnostic explanation is another algorithm that computes a more easily interpretable linear model around an individual observation. To determine the impact of an observed input variable value on the predicted probability or outcome of the target variable we can request the node to generate the HyperSHAP values. HyperSHAP calculates Shapely values which represents the average contribution of an input variable to the model's prediction, considering all possible combinations of the input variables.

 

 

 

 

LightGBM in Gradient Boosting Node

 

18_SoumitraDas_bl03_2024_LightGBMED-133x300.png

The LightGBM option determines if the LightGBM modeling algorithm will be utilized. Enabling this option provides access to additional configuration settings related to basic parameters and tree-splitting techniques.

 

LightGMB is a high-performance gradient boosting framework that is designed to be efficient and effective in handling large datasets and high-dimensional data. Rather than processing all data at each iteration, LightGBM employs Gradient-Based One-Side Sampling (GOSS) to prioritize data instances with larger gradients. This approach reduces the number of data instances required, thereby accelerating training while preserving significant accuracy.

 

LightGBM also utilizes Exclusive Feature Bundling (EFB) that combines mutually exclusive features—those that rarely take non-zero values at the same time—into a single feature. This reduces the total number of features, thereby enhancing training speed and lowering memory usage.

 

Unlike traditional gradient boosting frameworks that use level-wise tree growth, LightGBM employs leaf-wise growth. This approach splits the leaf with the highest loss reduction, producing deeper trees and potentially improving accuracy. However, if not properly managed, this method can lead to overfitting.

 

Thanks to innovations like GOSS and EFB, LightGBM frequently outpaces other gradient boosting frameworks in terms of speed. Its efficient tree-growing strategy often results in more accurate models.

 

For more information about the Gradient Boosting node, see Overview of Gradient Boosting in SAS Viya: Machine Learning Node Reference.

 

Additional Information

 

For more information on SAS Dynamic Actuarial Modeling visit the software information page here. For more information on curated learnings paths on SAS Solutions and SAS Viya, visit the SAS Training page. You can also browse the catalog of SAS courses here.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
a week ago
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels