SAS Data Science

mmaccora · Posted 10-21-2017 04:18 PM

Hi,

In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.

For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.

Performances on the test set for a decision threshold of 0.6:

Recall: 50%
Precision: 70%

After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.

The performances on the test set dramatically decreased:

Recall: 20%
Precision: 6%

Does someone know a scientific explanation to this counterintuitive phenomena ?

Thank you for your help,
Marco

PadraicGNeville · Posted 10-25-2017 09:48 AM

The standard approach of reducing variables is to run Forest several times, eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse. As Doug alluded, Forests can benefit from using many variable to create a complex model.

-Padraic

View solution in original post

DougWielenga · Posted 10-23-2017 10:38 AM

A few thoughts come to mind...

* Forests were designed to deal with massive numbers of variables and observations where the structure is unknown and investigating individual variables is temporally or computationally inefficient

* 8,500 observations and 84 variables --> not a lot of observations/variables for such a flexible modeling method which might make it very easy to overfit the data, particularly with a random forest which builds models on random subsets of observations using random subsets of variables as predictors

* Of the 84 variables used as predictors, selecting the 20 most important ignored 3/4 of the variables which seem to have assisted in making full use of the most important variables.

* Precision & Recall are fine but rely in part on the threshold you are using. I would interested in knowing the distribution of differences in the probability of the event of interest between both models. It is possible that the distribution of probabilities is relatively small even though Precision & Recall differences seem dramatic

* There could be great variability if the event was rare, but you are not looking at a large number of observations either way for such a flexible modeling strategy.

Hope this helps!
Doug

PadraicGNeville · Posted 10-25-2017 09:48 AM

The standard approach of reducing variables is to run Forest several times, eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse. As Doug alluded, Forests can benefit from using many variable to create a complex model.

-Padraic

SAS Data Science

Variable importance in random forest

Re: Variable importance in random forest

Re: Variable importance in random forest

Re: Variable importance in random forest

Random Forest

Variable Ranking in Random Forest

Variables in Random Forests in SAS EM

MSUG Presents: A Random Forest Example of the Boston Housing Data

Stratified bootstrap sampling with random forest

Follow Us

What is...

SAS Data Science

Our biggest data and AI event of the year.

Follow Us

What is...