LightGBM in SAS Model Studio

6 Likes

For the 2022.10 release (October 2022) of Model Studio, the very popular LightGBM gradient boosting framework has been added to Model Studio as an available supervised learning algorithm in the Gradient Boosting node. LightGBM is an open-source gradient boosting package developed by Microsoft, with its first release in 2016.

In Model Studio, because it is a variant of gradient boosting and shares many of its properties, the LightGBM algorithm has been integrated into the Gradient Boosting node. The following image contains a pipeline with the Gradient Boosting node. The Perform LightGBM checkbox, which is in the node properties, enables the LightGBM algorithm. When you select Perform LightGBM, the node displays the available LightGBM properties.

Clicking on the Run pipeline button at the top executes the Gradient Boosting node. Under the covers, the node executes the SAS LIGHTGRADBOOST procedure, which calls the lightGradBoost.lgbmTrain CAS action to run LightGBM. This trains the LightGBM model with the options that you specified, and produces training and assessment reports in the results. When you right-click on the Gradient Boosting node and select Results, the training reports are displayed on the Node tab. One of those reports is the Iteration History report, a line plot illustrating the change in training and validation accuracy as the boosting iterations (number of trees) increase. Note that the right-hand pane provides an automated description to help you interpret the plot.

An additional report is the Training Code report, which contains the Proc Lightgradboost training code. You can use this as an example syntax with which to train your own LightGBM models in SAS Studio.

Clicking on the Assessment tab in the results brings up a handful of model assessment reports. These reports assess the LightGBM model against all available data partitions, including Train, Validate, and Test, and are the standard assessment reports generated for any supervised learning node in Model Studio.

If you had selected post-training node properties to produce one or more Model Interpretability reports, these will display when you click on the Model Interpretability tab in the results. Gradient Boosting models, while very accurate, are not very interpretable, which makes these reports very important in understanding the LightGBM model. The reports displayed here include Surrogate Variable Importance, PD and ICE Plots (Partial Dependence and Individual Conditional Expectations), LIME Explanations (Local Interpretable Model-agnostic Explanations), and HyperSHAP Values (Shapley).

After exiting the node results, you can view and compare pipeline performance across pipelines by clicking on the Pipeline Comparison tab. Shown here are two LightGBM models that you can compare, with a flag that identifies the champion model. You can score new data by clicking your model and selecting Score holdout data from the Project pipeline menu (three vertical dots) at the top.

You can also do a side-by-side assessment comparison by selecting both models and clicking the “Compare” at the top, which produces assessment plots that include both models. And then you can register your model in Model Manager by selecting Register models from the Project pipeline menu. Once registered, you can maintain and track the performance of your model in Model Manager, in addition to publishing your model for deployment (you can also publish your model from the Project pipeline menu).

Given its popularity and wide usage, providing LightGBM as an available modeling algorithm within Model Studio increases the breadth of modeling options available to Model Studio users. With the power of Model Studio, LightGBM users will appreciate the ease with which assessment and model interpretability reports can be generated, the ease with which models can be compared, and the ease with which models can be registered and published for deployment into production.

Appendix

Below are descriptions of the LightGBM specific properties in the Gradient Boosting node, with corresponding open-source parameters in parentheses.

Basic Options

Boosting type (boosting) – A selector to choose the type of boosting algorithm to execute.
- Gradient boosting decision tree (gbdt) – This is the traditional gradient boosting method. Default.
- Dropouts additive regression trees (dart) – Mutes the effect of, or drops, one or more trees from the ensemble of boosted trees. This is effective in preventing over specialization.
- Gradient-based one-side sampling (goss) – Retains data instances with large gradients, or large training error, and down-samples instances with small gradients, or small training error.
Number of trees (num_iterations) – The number of boosting iterations. The default value is 100.
Learning rate (learning_rate) – The rate at which the gradient descent method converges to the minimum of the loss function. The default value is 0.1.
Bagging frequency rate (bagging_freq) – The iteration frequency at which the training data is sampled. As an example, for a value of 5, the data is sampled before training begins, and then after every five iterations. Sampling is enabled for a value greater than 0. The default value is 0.
Bagging fraction rate (bagging_fraction) – The fraction of the training data that is sampled when sampling is enabled (Bagging frequency rate > 0). This option is hidden until a Bagging frequency rate greater than 0 is entered. A value less than 1 is required. The default value is 0.5.
L1 regularization (lambda_l1) – In a regression model, a regularization parameter (lambda) which is applied to the absolute value of the coefficient in the penalty term to the loss function. The default value is 0.
L2 regularization (lambda_l2) – In a regression model, a regularization parameter (lambda) which is applied to the squared value of the coefficient in the penalty term to the loss function. The default value is 1.
Interval target objective function (objective) – A selector to choose the objective loss function for an interval target.
- Fair loss (fair)
- Gamma (gamma)
- Huber loss (huber)
- L1 regression (MAE) (regression_l1)
- L2 regression (MSE) (regression) – Default
- Mean absolute percentage error (mape)
- Poisson (poisson)
- Quantile (quantile)
- Tweedie (tweedie)
Nominal target objective function (objective) – A selector to choose the objective loss function for a nominal target. For a binary target, the binary log loss function is used.
- Multinominal logistic regression (multiclass) – Default
- One vs. rest classification (multiclassova)
Ensure deterministic results across job executions (deterministic) – A checkbox to enable deterministic results for the same data and parameters. Not selected by default.
Seed (seed) – The value used to generate random numbers for data sampling. The default value is 12345.

Tree-splitting Options

Maximum depth (max_depth) – The maximum number of generation of nodes, where generation 0 is the root node. The default value is 4.
Minimum leaf size (min_data_in_leaf) – The minimum number of training observations in a leaf. The default value is 5.
Use missing values (use_missing) – A checkbox to enable the handling of missing values. Selected by default.
Number of interval bins (max_bin) – The maximum number of bins for an interval input. The default value is 50.
Proportion of inputs to consider per tree (feature_fraction) – Proportion of inputs randomly sampled for use per tree. The default value is 1.
Maximum class levels (max_cat_threshold) – The maximum number of levels for a class input. The default value is 128.

SAS Communities Library