I have a question regarding the algorithm of Gradient Boosting Tree. I understand Simple tree is built for only a randomly selected sub sample of the full data set (random without replacement). Each consecutive tree is built for the prediction residuals (from all previous trees) of an independently drawn random sample.If an observation is incorrectly classified, then the observation will receive more weight in the next iteration so that it gets properly classified.
My question - How thousands of trees are built if each tree is built on different observations altogether (random without replacement)? For example, i have a data set consisting of 20,000 records. If i take 0.5 as the fraction of the training set observations randomly selected to propose the next tree in the expansion, how would i be able to create more than 2 trees? My understanding - 50% of the data set is used in building the first tree and the remaining 50% in building the next tree as samples are drawn without replacement. I apologize if it sounds lame. I am very confused in the tree building process.
Hey Ujjawal,
To answer your specific question, at each step you are using a random sample without replacement but that does not mean that the number of available observations decreases. In your example of 20 000 observations and 0.5 training fraction, at each step you re-weight the 20 000 observations, but only use 10 000 to train your tree at each step. This number is constant through all the boosting exercise. You sample 10 000 observations without replacement every time, but you still have 20 000 to sample from at each step.
Further discussion about training sample
You have the right idea. At each step of boosting, a new tree is trained with a different sample of re-weighted observations (whose weights are based on residuals). Jerome Friedman tried different values of training samples, no sampling included. He found experimentally that "stochastic" boosting was more accurate than just boosting. Instead of using the entire re-weighed data set to train a tree (weak classifier), he tried several training fractions and found out that both small and large data sets had an improvement in error for training fractions on the range [0.4, 0.8].
Take a look at J. Friedman's paper Stochastic Gradient Boosting. He goes into more detail about a stochastic gradient boosting used for regression. Look at figures 1 and 2 and their discussion.
Also try Gradient Boosting models in SAS Enterprise Miner with different values of training sample. You might find a case where training sample of 100 gives you a better model. But those cases are rare! The default training sample of 0.6 usually works great for the default maximum depth of 2.
If I had to rewrite this paper Leveraging Ensemble Models in SAS® Enterprise Miner™ , I would definitely beef up the discussion between boosting and stochastic gradient boosting.
I hope this helps!
-Miguel
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.