BookmarkSubscribeRSS Feed

Efficiently Modeling Interval Targets Using Bayesian Additive Regression Trees -Part 1

Started ‎02-29-2024 by
Modified ‎02-29-2024 by
Views 320

This article tries to provide a quick summary of Bayesian Additive Regression Trees (BART) models. We discuss about the mathematical model, regularization priors and Bayesian backfitting Markov chain Monte Carlo (MCMC) algorithm.

 

Decision trees are a nonparametric supervised learning method used for both classification and regression tasks. A classification tree models a categorical response and a regression tree models a continuous response. Recently, Ensemble of trees methods are frequently used in both regression and classification problems. As such algorithms like gradient boosting and random forests are widely used. Another ensemble method, Bayesian additive regression trees (BART) by Chipman et al.(2010) is gaining momentum in recent years due to its flexibility in dealing with interactions and non-linear effects. It can be applied to both regression and classification problems and yields competitive results when compared to other predictive models.

 

Bayesian Additive Regression Trees (BART)

 

The BART model is a flexible Bayesian nonparametric approach to regression that consists of a sum of multiple decision trees. The model uses regularization to constrain the trees to be weak learners to limit the contribution of each tree to the ensemble prediction.

 

The model is fitted using a Bayesian backfitting Markov chain Monte Carlo (MCMC) algorithm to generate posterior samples of the sum-of-trees ensemble. The BART model starts with a fixed number of trees that it iteratively updates to draw many samples of the ensemble. The final model makes predictions by averaging the predictions from all posterior samples that are saved for prediction.

 

The Model

 

Consider a prediction problem with a continuous response (target). Given the data of size n, with a continuous target Y and p covariates X (x1, x2,..xp), BART attempts to estimate a function f(x) from the models of the form Y=f(x) + Ɛ, where Ɛ~N{0,σ2}.

 

To estimate f(X), a sum of regression trees is specified as f(x) =

 

01_MS_Eq.png

 Equation 1

 

In Equation (1), Tj is the jth binary tree structure and Mj = {µ1j , . . ., µkjj } is the vector of terminal node parameters associated with Tj. The constant m represents the number of trees.

 

First, we quickly revisit the basic structure of a decision tree model. A decision tree is read from the top down starting at the root node. Each internal node represents a split based on the values of one of the inputs. The inputs can appear in any number of splits throughout the tree. Cases move down the branch that contains its input value.

 

In a binary tree with interval inputs, each internal node is a simple inequality. A case moves left if the inequality is true and right otherwise. The terminal nodes of the tree are called leaves. The leaves represent the predicted target. All cases reaching a particular leaf are given the same predicted value. The leaves of the decision tree partition the input space into rectilinear regions. The fitted regression tree model is a multivariate step function. A step function is highly flexible and is capable of modeling nonlinear trends. To simplify the understanding of the sum of trees model, consider an example with 2 trees and 3 inputs, i.e. m=2 and p=3. Let the structure of the 2 trees be as below-

 

02_MS_Sum-of-reg-300x140.png

 Figure 1

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

In this example, the final prediction is calculated by adding up the values of µkj from all the m(=2) trees. So, to predict the value of Y for any ith case, whose X1, X2 and X3 values are respectively 0.59, 6.5 and 4.9, we use the sum of regression trees model and allocate a sum of parameters µkj to case i. For the given values of X’s the regression tree 1 resulted in a predicted value of µ3j and regression tree 2 as µ1j . Hence the E(Yi|Xi) = µ3j + µ1j. Note that the predicted value that is allocated to ith case is the sum of parameters in the leaf node for each tree rather than the mean of the parameter values (µkj’s).   This is because BART calculates each posterior draw of the regression tree function g(X; Tj, Mj) using a leave-one-out concept which will be discussed shortly.

 

Sampling the posterior: Bayesian Backfitting Markov Chain Monte Carlo (MCMC) algorithm

 

The Markov chain Monte Carlo (MCMC) method is a general simulation method for sampling from posterior distributions and computing posterior quantities of interest. MCMC methods sample successively from a target distribution. Each sample depends on the previous one, hence the notion of the Markov chain. The backfitting procedure is a modular way of fitting an additive model. It cycles through the predictors, replacing each current function estimate by a new function derived from smoothing a partial residual on each predictor. Given a training data set, a Bayesian backfitting MCMC algorithm is used to draw samples from the posterior distribution (Chipman, George, and McCulloch 2010). This algorithm is a form of Gibbs sampling. In our example let T-j  denote the tree structures for the m-1 trees in the ensemble, excluding the jth tree. Likewise, let M-j  denote the set of all leaf parameters, excluding the jth tree parameters. As described by Chipman et al. (2010), the draws for the tree structures Tj and leaf parameters Mj are simplified by observing that their conditional distribution depends on T-j, M-j and Y only through the partial residuals for the fit excluding the jth tree. Let Rj= Y- ∑k≠j  g(X; Tj, Mj) denote the partial residual that are based on the fit, excluding the jth tree. You then obtain the samples for Tj and Mj by taking successive draws from

 

Tj|Rj, σ2     

 

Mj|Tj,Rj, σ2   

 

The draws for the tree structure Tj are obtained using the Metropolis-Hastings (MH) sampling algorithm as described by Chipman et al., 1998. This algorithm considers sampling from four operations that modify the tree structure by splitting a terminal node, pruning a pair of terminal nodes, changing the splitting rule of a nonterminal node, or swapping a splitting rule between a parent node and a child node. The BART procedure (in SAS Viya) considers only pruning and splitting operations. In the growing process, a terminal node is randomly selected and is split into two new nodes. The splitting variable and the splitting point is defined assuming the uniform distribution. During a prune step, a parent of two leaf (terminal) nodes is randomly chosen and then its child nodes are removed.

 

How about taking a simple example to understand this mechanism? In this example we consider a continuous response variable (Y) and three inputs X=(x1, x2, x3). We run this algorithm with three regression trees (m=3) for 4 iterations. The discussion to follow presents the regression tree structures as you go through each MCMC step. The process begins by initializing the three regression trees to single root nodes. The parameters initialized for these nodes would be μ  = ȳ ⁄m = ȳ ⁄3

 

03_MS_Fig2_1-300x65.png

Figure 2

 

Now let us see how the tree structures are drawn for each regression tree in first iteration i.e. after initialization step. Note that the order of computation of regression tree does not matter, hence we may start with determining any tree. Here we start with determining the first regression tree (T1, M1). We calculate the partial residual R1 that excludes the first tree (T1), i.e.  R1= Y- ∑j≠1 g(X; Tj, Mj) = Y- [g(X, T2, M2) + g(X, T3, M3)] = Y- 2 ȳ ⁄3.

 

The MH algorithm helps propose a new tree structure. Let T1* denote the sampled modification to the tree structure T1 and then calculate the probability of whether T1* should be accepted or rejected. If T1* is accepted, then T1 is updated to T1*, else nothing changes for T1. In our example we see that T1* was not accepted in the first iteration hence the tree structure remains the same, i.e. as a single root node (refer to Iteration 1, Tree j=1 below). Next, a draw is taken for the set of leaf parameters M1 (in this case) and algorithm then updates M1 based on the new updated tree structure for T1. This continues and the partial residuals Rj+1 are then updated for sampling the (j+1)th tree structure Tj+1 and (j+1)th set of leaf parameters Mj+1. To determine (T2, M2), the algorithm calculates

R2= Y- ∑j≠2 g(X; Tj, Mj) = Y- [g(X, T1, M1) + X, T3, M3)] = Updated_Capture2.PNG where, Updated_Capture3.PNGis the updated parameter for T1.  Again, MH algorithm is used to propose a new tree structure (T2*) for T2. Partial residual R2 is used to calculate the acceptance probability and decide if T2* should be accepted or rejected. In our example, T2* was accepted and therefore a new tree structure was used in iteration 1 (refer to Iteration 1, Tree j=2 below). Then the leaf parameters M2 are updated based on the new updated tree structure T2*. Next, R3 is used to determine new structure for T3.

 

R3 now takes the form Updated_Capture.PNG.

 

Well, as seen in figure 3, T3* was not accepted hence the tree structure remains the same, i.e., as a single root node.

 

04_MS_Fig3_1-300x92.png

Figure 3 

 

 

05_MS_Iteration2-300x145.png

Figure 4 

 

 

06_MS_Iteration3-300x167.png

Figure 5

 

 

07_MS_Iteration4-300x157.png

Figure 6 

 

Figure 2 through figure 6 depicts the iterations from initiation to iteration 4 for the three regression trees. Also, note at each iteration how the regression trees change.  At each iteration, each tree may increase or decrease the number of terminal nodes by one. This iterative process runs for a burn-in period (by default, 100 iterations in SAS Model Studio) before the main simulation loop. Burn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the effect of initial values on the posterior inference is minimized. It is the sample from the main simulation loop that are saved for prediction and computing posterior statistics.

 

Regularization prior

 

BART model requires you to specify a prior for the tree structure, the tree parameters, and the variance parameter σ2. The use of a regularization prior enforces the so-called weak learner property on the individual trees. This approach sinks with the idea that many weak learners combined together perform much better than using a strong model that requires careful tweaking in the order for the model to perform well.

 

At this point you have a fair idea about BART model and the underlying algorithm. Now, you must be excited about fitting and exploring a BART model. In the next post, I will discuss and demonstrate how to train a BART model in SAS Model Studio.

 

References:

 

  1. Bayesian additive regression trees and the General BART model
  2. SAS Documentation: Sampling the Posterior details.
  3. Chapman et al. 1998

 

 

Find more articles from SAS Global Enablement and Learning here.

Comments

All the information related to BART is under one roof. I found it helpful. 

Version history
Last update:
‎02-29-2024 05:21 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags