Three questions about Training and Validation Data Sets

WWD · Posted 08-11-2021 03:16 PM

My question pertains to the following subject matter:

Course = AI and Machine Learning Professional

Module= Machine Learning Specialist

Lession = Lesson 3 Decision Trees and Ensembles of Trees.

The first question: When scoring data using an ensemble of trees, is the entire validation data set scored by each of the individual trees in the ensemble?

The second question is: If my training dataset has 1000 points, will each "bagging" sample (sampling done with replacement) used to build a tree in the ensemble contain 1000 data points? or is this an hyperparameter that the statistician can set within Model Studio? A follow-up question then becomes if my original dataset contains 400 data points, may a bagging sample, drawn from the original 400, contain more than 400 points and be mathematically defensiveable?

The third question is: Is the only difference between bagging and boosting is how the sample is selected for each tree in the ensemble. For bagging, the sample is with replacement for boosting the sampling is based on weights. But, in the end, each method when applied to the same original dataset of size 500, will produce samples of size 500?

Thank you,

Bill Donaldson

AriZitin · Posted 08-11-2021 05:24 PM

These are great questions!

Q1: Yes, this is exactly correct, the data is scored using each tree and then the predictions are ensembled (averaged for the forest models and combined in a weighted way in the gradient boosting model)

Q2: This is a hyperparameter you can control for the two different ensemble models in SAS Viya. For the Gradient Boosting model there is a hyperparameter named "Subsample rate" (the default value is 0.5) that determines the proportion of training observations to use to train each tree. For the Forest model there is a hyperparameter named "In-bag sample proportion" that does the same thing (with a default value of 0.6). The software requires that these numbers be less than 1, so you can't use this to 'up-sample' the data. I am not certain about specifically how defensible the 'up-sampling' would be, but it wouldn't add any new information to the model so I don't it would generally be helpful in training a better ensemble model. Oversampling can be useful if you have imbalanced classes, but it would be better to oversample the whole dataset and then build trees as usual instead of trying to build the oversampling into the tree-based ensemble. In general if you want to 'up-sample' the data it might be better to use a technique for generating synthetic data like SMOTE (Synthetic Minority Over-sampling Technique).

Q3: In the bagging and boosting algorithms described in the eLearning videos you are exactly correct, they are just two different ways to select samples for training the ensemble of trees. One major difference hiding behind this statement is that the boosting algorithm generates each subsequent sample based on the results of fitting the previous trees, so there is a notion of ordering for the trees (the first tree, the second tree, etc.) whereas the bagging algorithm could in theory generate all of the random samples before training any trees. One thing to note is that neither the boosting nor the bagging algorithm are implemented in SAS Viya, instead we have the more sophisticated Gradient Boosting and Forest algorithms. In terms of the question about the sample size you generate with the two methods, you can definitely generate a sample of size 500 from a dataset of size 500, but in the software it will end up depending on the hyperparameters (subsample rate and in-bag sample proportion) I described in the answer to Q2.

-Ari

WWD · Posted 08-11-2021 06:29 PM

Ari:

Thank you for answering these questions plus the previous questions that you answered.

Bill

Three questions about Training and Validation Data Sets

Re: Three questions about Training and Validation Data Sets

Re: Three questions about Training and Validation Data Sets

Three questions about Training and Validation Data Sets

Re: Three questions about Training and Validation Data Sets

Re: Three questions about Training and Validation Data Sets

SAS Training: Just a Click Away