This article tries to provide a quick summary of Bayesian Additive Regression Trees (BART) models. We discuss about the mathematical model, regularization priors and Bayesian backfitting Markov chain Monte Carlo (MCMC) algorithm.
Decision trees are a nonparametric supervised learning method used for both classification and regression tasks. A classification tree models a categorical response and a regression tree models a continuous response. Recently, Ensemble of trees methods are frequently used in both regression and classification problems. As such algorithms like gradient boosting and random forests are widely used. Another ensemble method, Bayesian additive regression trees (BART) by Chipman et al.(2010) is gaining momentum in recent years due to its flexibility in dealing with interactions and non-linear effects. It can be applied to both regression and classification problems and yields competitive results when compared to other predictive models.
Bayesian Additive Regression Trees (BART)
The BART model is a flexible Bayesian nonparametric approach to regression that consists of a sum of multiple decision trees. The model uses regularization to constrain the trees to be weak learners to limit the contribution of each tree to the ensemble prediction.
The model is fitted using a Bayesian backfitting Markov chain Monte Carlo (MCMC) algorithm to generate posterior samples of the sum-of-trees ensemble. The BART model starts with a fixed number of trees that it iteratively updates to draw many samples of the ensemble. The final model makes predictions by averaging the predictions from all posterior samples that are saved for prediction.
The Model
Consider a prediction problem with a continuous response (target). Given the data of size n, with a continuous target Y and p covariates X (x 1 , x 2 ,..x p ), BART attempts to estimate a function f(x) from the models of the form Y=f(x) + Ɛ, where Ɛ~N{0,σ 2 }.
To estimate f(X), a sum of regression trees is specified as f(x) =
Equation 1
In Equation (1), T j is the j th binary tree structure and M j = {µ 1j , . . ., µ kjj } is the vector of terminal node parameters associated with T j . The constant m represents the number of trees.
First, we quickly revisit the basic structure of a decision tree model. A decision tree is read from the top down starting at the root node. Each internal node represents a split based on the values of one of the inputs. The inputs can appear in any number of splits throughout the tree. Cases move down the branch that contains its input value.
In a binary tree with interval inputs, each internal node is a simple inequality. A case moves left if the inequality is true and right otherwise. The terminal nodes of the tree are called leaves. The leaves represent the predicted target. All cases reaching a particular leaf are given the same predicted value. The leaves of the decision tree partition the input space into rectilinear regions. The fitted regression tree model is a multivariate step function. A step function is highly flexible and is capable of modeling nonlinear trends. To simplify the understanding of the sum of trees model, consider an example with 2 trees and 3 inputs, i.e. m=2 and p=3. Let the structure of the 2 trees be as below-
Figure 1
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
In this example, the final prediction is calculated by adding up the values of µ k j from all the m(=2) trees. So, to predict the value of Y for any i th case, whose X 1 , X 2 and X 3 values are respectively 0.59, 6.5 and 4.9, we use the sum of regression trees model and allocate a sum of parameters µ k j to case i. For the given values of X’s the regression tree 1 resulted in a predicted value of µ 3j and regression tree 2 as µ 1j . Hence the E(Yi|Xi) = µ 3j + µ 1j. Note that the predicted value that is allocated to ith case is the sum of parameters in the leaf node for each tree rather than the mean of the parameter values (µ k j ’s). This is because BART calculates each posterior draw of the regression tree function g(X; T j , M j ) using a leave-one-out concept which will be discussed shortly.
Sampling the posterior: Bayesian Backfitting Markov Chain Monte Carlo (MCMC) algorithm
The Markov chain Monte Carlo (MCMC) method is a general simulation method for sampling from posterior distributions and computing posterior quantities of interest. MCMC methods sample successively from a target distribution. Each sample depends on the previous one, hence the notion of the Markov chain. The backfitting procedure is a modular way of fitting an additive model. It cycles through the predictors, replacing each current function estimate by a new function derived from smoothing a partial residual on each predictor. Given a training data set, a Bayesian backfitting MCMC algorithm is used to draw samples from the posterior distribution (Chipman, George, and McCulloch 2010). This algorithm is a form of Gibbs sampling. In our example let T -j denote the tree structures for the m-1 trees in the ensemble, excluding the j th tree. Likewise, let M -j denote the set of all leaf parameters, excluding the j th tree parameters. As described by Chipman et al. (2010), the draws for the tree structures T j and leaf parameters M j are simplified by observing that their conditional distribution depends on T -j , M -j and Y only through the partial residuals for the fit excluding the j th tree. Let R j = Y- ∑ k≠j g(X; T j , M j ) denote the partial residual that are based on the fit, excluding the j th tree. You then obtain the samples for T j and M j by taking successive draws from
T j |R j , σ 2
M j |T j, R j , σ 2
The draws for the tree structure T j are obtained using the Metropolis-Hastings (MH) sampling algorithm as described by Chipman et al., 1998. This algorithm considers sampling from four operations that modify the tree structure by splitting a terminal node, pruning a pair of terminal nodes, changing the splitting rule of a nonterminal node, or swapping a splitting rule between a parent node and a child node. The BART procedure (in SAS Viya) considers only pruning and splitting operations. In the growing process, a terminal node is randomly selected and is split into two new nodes. The splitting variable and the splitting point is defined assuming the uniform distribution. During a prune step, a parent of two leaf (terminal) nodes is randomly chosen and then its child nodes are removed.
How about taking a simple example to understand this mechanism? In this example we consider a continuous response variable (Y) and three inputs X=(x 1 , x 2 , x 3 ). We run this algorithm with three regression trees (m=3) for 4 iterations. The discussion to follow presents the regression tree structures as you go through each MCMC step. The process begins by initializing the three regression trees to single root nodes. The parameters initialized for these nodes would be μ = ȳ ⁄m = ȳ ⁄3
Figure 2
Now let us see how the tree structures are drawn for each regression tree in first iteration i.e. after initialization step. Note that the order of computation of regression tree does not matter, hence we may start with determining any tree. Here we start with determining the first regression tree (T 1 , M 1 ). We calculate the partial residual R 1 that excludes the first tree (T 1 ), i.e. R 1 = Y- ∑ j≠1 g(X; T j , M j ) = Y- [g(X, T 2 , M 2 ) + g(X, T 3 , M 3 )] = Y- 2 ȳ ⁄3.
The MH algorithm helps propose a new tree structure. Let T 1 * denote the sampled modification to the tree structure T 1 and then calculate the probability of whether T1* should be accepted or rejected. If T 1 * is accepted, then T 1 is updated to T 1 *, else nothing changes for T 1 . In our example we see that T 1 * was not accepted in the first iteration hence the tree structure remains the same, i.e. as a single root node (refer to Iteration 1, Tree j=1 below). Next, a draw is taken for the set of leaf parameters M 1 (in this case) and algorithm then updates M 1 based on the new updated tree structure for T 1 . This continues and the partial residuals R j+1 are then updated for sampling the (j+1) th tree structure T j+1 and (j+1) th set of leaf parameters M j+1 . To determine (T 2 , M 2 ), the algorithm calculates
R 2 = Y- ∑ j≠2 g(X; T j , M j ) = Y- [g(X, T 1 , M 1 ) + X, T 3 , M 3 )] = where, is the updated parameter for T 1. Again, MH algorithm is used to propose a new tree structure (T 2 *) for T 2 . Partial residual R 2 is used to calculate the acceptance probability and decide if T 2 * should be accepted or rejected. In our example, T 2 * was accepted and therefore a new tree structure was used in iteration 1 (refer to Iteration 1, Tree j=2 below). Then the leaf parameters M 2 are updated based on the new updated tree structure T 2 *. Next, R 3 is used to determine new structure for T 3 .
R 3 now takes the form .
Well, as seen in figure 3, T 3 * was not accepted hence the tree structure remains the same, i.e., as a single root node.
Figure 3
Figure 4
Figure 5
Figure 6
Figure 2 through figure 6 depicts the iterations from initiation to iteration 4 for the three regression trees. Also, note at each iteration how the regression trees change. At each iteration, each tree may increase or decrease the number of terminal nodes by one. This iterative process runs for a burn-in period (by default, 100 iterations in SAS Model Studio) before the main simulation loop. Burn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the effect of initial values on the posterior inference is minimized. It is the sample from the main simulation loop that are saved for prediction and computing posterior statistics.
Regularization prior
BART model requires you to specify a prior for the tree structure, the tree parameters, and the variance parameter σ 2 . The use of a regularization prior enforces the so-called weak learner property on the individual trees. This approach sinks with the idea that many weak learners combined together perform much better than using a strong model that requires careful tweaking in the order for the model to perform well.
At this point you have a fair idea about BART model and the underlying algorithm. Now, you must be excited about fitting and exploring a BART model. In the next post, I will discuss and demonstrate how to train a BART model in SAS Model Studio.
References:
Bayesian additive regression trees and the General BART model
SAS Documentation: Sampling the Posterior details.
Chapman et al. 1998
Find more articles from SAS Global Enablement and Learning here.
... View more