An Overview of Light Gradient Boosting Machine (LightGBM) Model in SAS Viya

A Gradient Boosting model consists of multiple decision trees. The trees are built sequentially. Each tree is trained by splitting the subsampled data, then splitting each resulting segment, and so on recursively until some constraint is met.

A new technique light gradient boosting machine model (LightGBM) proposed by Ke et al. (2017) is a high-performance gradient boosting framework based on decision tree algorithms that is used for regression, classification, and many other machine learning tasks. It extends the gradient boosting algorithm by adding a type of automatic feature selection as well as focusing on boosting examples with larger gradients.

This post attempts to describe the underlying algorithm in LightGBM and how it differs from traditional gradient boosting algorithm.

Background

Gradient boosting is an ensemble model of decision trees, which are trained in sequence. In each iteration, it learns the decision trees by fitting the negative gradients (also known as residual errors). The most time-consuming part in learning a decision tree is to find the best split points. One of the most popular algorithms to find split points is the split search algorithm, which enumerates all possible split points on the pre-sorted feature values. This algorithm is simple and can find the optimal split points, however, it is inefficient in both training speed and memory consumption. Another popular algorithm is the histogram-based algorithm. Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the number of bins is usually much smaller than the number of instances the histogram-based algorithm is more efficient in both memory consumption and training speed. However, if we can further reduce the number of instances or number of features, we will be able to substantially speed up the training process.

LightGBM is a gradient-boosting framework based on decision trees designed to be distributed and efficient, and it offers the following advantages:

higher training speed and greater efficiency
lower memory usage
better accuracy
support of parallel, distributed processing
ability to handle large-scale data

It uses two novel techniques: gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB).

Gradient-based One Side Sampling Technique for LightGBM:

This method uses a different sampling method that can achieve a good balance between reducing the number of data instances and keeping the accuracy for learned decision trees. Recall that in traditional gradient boosting algorithm the gradient (residual error) for each observation provides useful information, i.e. if an observation is associated with a small gradient, then it means that the training error is small, and it is already well-trained. So, a straightforward approach to reduce the number of instances is to discard those observations with small gradients and focus on observations with large gradients. However, this approach will change the data distribution and can hurt the accuracy of the learned model. To avoid this issue, GOSS (Gradient Based One Side Sampling) uses a novel sampling method in which it keeps all the observations with large gradients and down samples the observations with small gradients. And in order to compensate the influence to the data distribution, when computing the information gain, GOSS introduces a constant multiplier for the observations with small gradients.

Exclusive Feature Bundling Technique for LightGBM

In real applications, although there are a large number of features, the feature space is usually quite sparse, which offers a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive; that is, they rarely take nonzero values simultaneously. EFB bundles these features, reducing dimensionality to improve efficiency while maintaining a high level of accuracy. The bundle of exclusive features into a single feature is called an exclusive feature bundle. Now, you can build the same feature histograms from the feature bundles as those from individual features. In this way the complexity of histogram building can be reduced as the number of bundles are much less than the number of features. This speedup the training process without hurting accuracy.

Architecture of LightGBM

LightGBM splits the tree leaf-wise as opposed to other boosting algorithms that grow tree level-wise. It chooses the leaf to split that it believes will yield the largest decrease in loss function. Because leaf-wise chooses splits based on their contribution to the global loss and not just the loss along a particular branch, it often (not always) will learn lower-error trees "faster" than level-wise.

Below is a diagrammatic representation that shows the difference in split order between a hypothetical binary leaf-wise tree and a hypothetical binary level-wise tree. Note that other orderings may be chosen for the leaf-wise tree while the order is always the same in the level-wise tree.

Leaf-wise tree

Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.

Level-wise tree

LIGHTGRADBOOST Procedure in SAS

The LIGHTGRADBOOST procedure trains a gradient boosting tree model by using the LightGBM method. Let us consider an example where a national veterans’ organization wants to better target its solicitations for donation. The PVA_Train table contains observations for 74582 individuals. A variable named Target Gift Flag (Target_B) is a class target that has two levels, 1(response) and 0(no response). Additionally, several categorical and continuous measurements are available that includes demographic inputs, promotion inputs and gift inputs that summarize the previous donation history.

For this example, it is assumed that the PVA_Train and PVA_Test data tables are already loaded in memory and are accessed through Public caslib, but you can substitute any appropriately defined CAS engine libref.

PROC LIGHTGRADBOOST treats numeric variables as interval inputs unless you specify otherwise. Character variables are always treated as nominal inputs.

The BOOSTING= option in the LIGHTGRADBOOST procedure statement specifies type of boosting to use. Default value is GBDT (gradient boosting decision tree), however in this example you use gradient based one-side sampling (GOSS) method. The OBJECTIVE= option specifies the objective function to use and the DETERMINISTIC option ensures stable results when you use the same data and the same parameters. No additional parameters are specified in the PROC LIGHTGRADBOOST statement; therefore, the procedure uses all default values. For example, the number of trees in the boosting model is 100, and the number of bins for interval input variables is 255. Note that the VALIDDATA=option allows you to specify the validation data to avoid overfitting.

The INPUT and TARGET statements are required in order to run PROC LIGHTGRADBOOST. The INPUT statement indicates which variables to use to build the model, and the TARGET statement indicates which variable the procedure predicts. The SAVESTATE statement creates an analytic store for the model and saves it as a binary object in a data table. You can use the analytic store in the ASTORE procedure to score new data.

proc lightgradboost data=public.PVA_Train validdata=public.PVA_Test
                    boosting=GOSS objective=binary deterministic;
   input statuscat96NK DemHomeOwner Demcluster / level = nominal;
   input GiftCnt36 GiftCntAll GiftCntCard36 GiftCntCardAll GiftAvgLast
	GiftAvg36 GiftAvgAll GiftAvgCard36 GiftTimeLast GiftTimeFirst
        DemMedIncome / level = interval;
   target Target_B / level = nominal;
   SAVESTATE RSTORE=public.lgbmStore ;
   output out=public.PVA_out;
run;

The successful execution of code produces results and output data. Model Information provides a brief description of the settings used to create the model including boosting method, objective function, accuracy metric (binary log loss) etc.

The Iteration History table shows how the binary log loss function value changes as the number of trees in the model increases.

For this model, the minimum objective function for the VALIDATE partition is 0.5277 and occurs for 100 trees, so the validation objective function value is still decreasing at the last tree. Having said that analyst may want to experiment by increasing the number of trees (along with other parameters/options) to further improve the performance of model. The PVA_Out table can be accessed to display the scoring results of training data.

The generated columns P_TARGET_B0 and P_TARGET_B1 contain the predicted probabilities of the target variable TARGET_B with respective labels, and the generated column I_TARGET_B contains the predicted label.

Are you not comfortable writing codes for developing models? Well, you can even fit LIGHTGBM model using Model Studio. To get more details, look through this article here: LIGHTGBM in SAS Model Studio.

Find more articles from SAS Global Enablement and Learning here.