In the insurance sector, precise pricing and effective risk modeling are essential for both profitability and long-term stability. Insurers need to meticulously evaluate numerous factors to set premiums that accurately reflect the risk level of each policyholder.
While traditional models like GLMs and GAMs are robust tools that have been used extensively in actuarial studies, they often face difficulties in identifying complex interactions among numerous, highly overlapping risk features. This limitation stems mainly from their linear or additive structures, which may not effectively capture the relationships between variables. Additionally, these models usually require explicit specification of interaction terms, which can be challenging when dealing with a large number of potential interactions or when the nature of these interactions is not predefined.
Recent advancements have seen a shift towards sophisticated machine learning methods such as gradient boosting, which offer enhanced predictive capabilities. Gradient boosting is a robust ensemble learning technique that merges multiple weak predictive models, usually decision trees, to form a strong overall model.
This allows us to increasingly concentrate on the harder-to-predict cases, learning from the cumulative mistakes. The ultimate model is a weighted aggregate of all the weak learners, with each one striving to enhance the performance of the preceding models.
Gradient Boosting Model owe its moorings to the Ensemble methodology. The ensemble methodology was the direct outcome of the perennial quest to answer how we can determine, in advance, which algorithm will perform best for a specific problem. One key insight we have gained after over three decades of experimentation is that a simpler and more effective way to enhance model accuracy, rather than meticulously selecting a single algorithm, is to combine multiple models into ensembles.
In the early 1990s, several researchers independently discovered that ensembles could improve classification performance. The most significant early advancements were made by Breiman in 1996 with Bagging (Bootstrap Aggregating), and by Freund and Schapire in 1996 with AdaBoost (Adaptive Boosting).
Creating an ensemble involves building diverse models and then combining their predictions. Component models can be generated by altering case weights, data values, guidance parameters, variable subsets, or partitions of the input space. While combination can be achieved through voting, it is mainly done by weighting the model estimates.
For instance, Bagging bootstraps the training dataset to create varied decision trees and then aggregates their predictions through majority voting or averaging. Random Forest introduces a random element to increase the diversity among the combined trees. AdaBoost iteratively builds models by adjusting case weights—giving more weight to cases with large errors and less to accurately predicted ones—and combines the models' predictions using a weighted sum. Gradient Boosting expanded the AdaBoost algorithm to accommodate a wide range of error functions for both regression and classification tasks.
The concept of boosting is based on the hypothesis boosting problem posed by Kearns (1988) and Valiant (1989). In essence, the hypothesis boosting problem explores whether an efficient learning algorithm that generates a weak hypothesis or weak learner performs just a bit better than random guessing implies the existence of another efficient algorithm capable of producing a highly accurate hypothesis or strong learner.
In simple words, hypothesis boosting involves the strategy of filtering observations, retaining those that the weak learner can manage, and then developing additional weak learners to tackle the more challenging observations that remain. Algorithms designed to achieve this transformation came to be referred to as "boosting" algorithms.
The first highly successful implementation of boosting was AdaBoost. AdaBoost and similar algorithms were first reinterpreted within a statistical framework by Breiman, who referred to them as Adaptive Reweighting and Combining or ARCing algorithms.
Friedman (1999) further advanced this framework, introducing it as Gradient Boosting Machines, which later became known popularly as Gradient Boosting. In this approach boosting was redefined as a numerical optimization problem, aiming to minimize a model's loss by incrementally adding weak learners through a gradient descent-like procedure. The algorithm is characterized as a stage-wise additive model, where one new weak learner is added at a time, while the existing weak learners in the model remain unchanged.
Gradient boosting consists of three key components:
Let us assume that our training data consists of N cases, which we denote mathematically as:
The goal of gradient boosting is to iteratively constructs a collection of functions
by minimizing the expected value of the loss function L(Y, F(X)).
The functions
are the iterative estimates of F(X), where the function F(X) associates each instance X, the vector of inputs, with its corresponding output value Y.
In other words,
We can reformulate the estimation problem using expectations and simplify estimation of F(X) by limiting the search space to a specific parametric family of functions such that:
The response variable Y can originate from various distributions, which necessitates the use of different loss functions L(.). In addition, since we do not assume any specific form for the true functional relationship F(X) or for its estimate, it is not possible to solve this algebraically. Rather, iterative numerical methods are used for estimation of F(X).
If M is the number of iterations, then the estimate of F(X) is given as:
where,
is the initial guess, and
are the incremental values obtained at each iteration. These are also termed as boosts.
In order to obtain feasible solution, the subsequent iterations resolve to minimizing a weighted sum of base-learner or weak-learner functions h(X,θ), which are models of the ensembles (e.g., decision trees).
Therefore, at iteration t we can rewrite the estimation of F(X):
Where ρt is the optimal step-size for each iteration.
After the initial step all the subsequent models at iteration t attempt to minimize
So, instead of directly solving the optimization problem, each weak-learner function, ht(X)), can be viewed as the steepest gradient descent to decrease the value of the empirical loss function L(.). The value of ρt is computed by using the line search optimization approach, an iterative technique used to locate a local minimum of a multidimensional nonlinear function by utilizing the gradients of the function.
Each model is trained on a new dataset where the pseudo-residuals ri,t is calculated as:
The above equation brings forth an important point. Namely that our loss function chosen must be continuous and differentiable in order obtain the pseudo-residuals.
Furthermore, the incorporation of the residuals predicted by a weak model into an existing model's prediction helps steer the model closer to the correct target. Repeatedly adding these residuals enhances the overall accuracy of the model's predictions.
The algorithm can be prone to overfitting if the iterative process is not adequately regularized. A common method to regularize gradient boosting is to apply shrinkage (like L1 and L2 regularizations available in the gradient boosting node in SAS Viya), which reduces the impact of each gradient descent step (ρt). The rationale behind this approach is that making numerous small improvements to a model is more effective than making a few large adjustments.
Moreover, additional regularization can be attained by restricting the complexity of the trained models. For decision trees, this can be done by limiting the depth of the trees or setting a minimum number of instances required to split a node.
Early stopping is another effective method for enhancing the generalization capabilities of the model being estimated. This technique employs a validation data to identify the optimal number of iterations needed to construct a model that generalizes well to unseen data.
Lastly, gradient boosting variations often include parameters that introduce randomness into the base learners. This can enhance the ensemble's generalization by employing techniques like random subsampling without replacement.
For any resulting model it is immensely advantageous to be able to interpret the results. A common task is to determine variable importance. In ensembles, performing feature selection can be challenging because we cannot distinguish the main effects from the interaction effects. Conceptually the measure variable importance is based on the number of times a variable is selected for splitting, as well the improvement to the model ‘s performance as a result of the splits.
Partial Dependence plot is another greater interpretative tool in gradient boosting. Partial dependence illustrates the impact of a variable on the modeled response after averaging out the effects of all other explanatory variables. Although the precise method involves numerically integrating other variables over a suitable grid of values, but this can be computationally intensive. Consequently, a simpler approach is often employed, where the marginalized variables are fixed at a constant value, typically their sample mean.
The Gradient Boosting node in SAS Dynamic Actuarial Modeling solution builds predictive models using the gradient boosting methodology outlined above. The node constructs multiple decision trees, typically using independent samples of the data without replacement and improves its predictions by minimizing a chosen loss function.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The basic options for the node allow us to specify the number of iterations, the subsample rate, L1 and L2 regularizations amongst other things. The number iterations are equal to the number of trees specified for both interval and binary target variables.
The subsample rate indicates the percentage of training observations used to train each tree. In each iteration, a distinct training sample is used. During the same iteration, all trees are trained on the identical training data.
Since the core of the algorithm relies on decision trees, various option related to decision tree-splitting options are also available. Some of these include the maximum number of branches, maximum tree depth, minimum leaf size, how to treat missing values, and so on.
For interpretability purposes, the gradient boosting node allows us to generate variable importance of input variables and partial dependency plots that illustrates the functional relationship between the input variable and the model's prediction. In addition, we can also instruct the node to present individual conditional expectation (ICE) plots, which can uncover intriguing subgroups and interactions among model variables.
LIME or locally interpretable model-agnostic explanation is another algorithm that computes a more easily interpretable linear model around an individual observation. To determine the impact of an observed input variable value on the predicted probability or outcome of the target variable we can request the node to generate the HyperSHAP values. HyperSHAP calculates Shapely values which represents the average contribution of an input variable to the model's prediction, considering all possible combinations of the input variables.
The LightGBM option determines if the LightGBM modeling algorithm will be utilized. Enabling this option provides access to additional configuration settings related to basic parameters and tree-splitting techniques.
LightGMB is a high-performance gradient boosting framework that is designed to be efficient and effective in handling large datasets and high-dimensional data. Rather than processing all data at each iteration, LightGBM employs Gradient-Based One-Side Sampling (GOSS) to prioritize data instances with larger gradients. This approach reduces the number of data instances required, thereby accelerating training while preserving significant accuracy.
LightGBM also utilizes Exclusive Feature Bundling (EFB) that combines mutually exclusive features—those that rarely take non-zero values at the same time—into a single feature. This reduces the total number of features, thereby enhancing training speed and lowering memory usage.
Unlike traditional gradient boosting frameworks that use level-wise tree growth, LightGBM employs leaf-wise growth. This approach splits the leaf with the highest loss reduction, producing deeper trees and potentially improving accuracy. However, if not properly managed, this method can lead to overfitting.
Thanks to innovations like GOSS and EFB, LightGBM frequently outpaces other gradient boosting frameworks in terms of speed. Its efficient tree-growing strategy often results in more accurate models.
For more information about the Gradient Boosting node, see Overview of Gradient Boosting in SAS Viya: Machine Learning Node Reference.
For more information on SAS Dynamic Actuarial Modeling visit the software information page here. For more information on curated learnings paths on SAS Solutions and SAS Viya, visit the SAS Training page. You can also browse the catalog of SAS courses here.
Find more articles from SAS Global Enablement and Learning here.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.