- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.
For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.
Performances on the test set for a decision threshold of 0.6:
Recall: 50%
Precision: 70%
After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.
The performances on the test set dramatically decreased:
Recall: 20%
Precision: 6%
Does someone know a scientific explanation to this counterintuitive phenomena ?
Thank you for your help,
Marco
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The standard approach of reducing variables is to run Forest several times, eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse. As Doug alluded, Forests can benefit from using many variable to create a complex model.
-Padraic
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A few thoughts come to mind...
* Forests were designed to deal with massive numbers of variables and observations where the structure is unknown and investigating individual variables is temporally or computationally inefficient
* 8,500 observations and 84 variables --> not a lot of observations/variables for such a flexible modeling method which might make it very easy to overfit the data, particularly with a random forest which builds models on random subsets of observations using random subsets of variables as predictors
* Of the 84 variables used as predictors, selecting the 20 most important ignored 3/4 of the variables which seem to have assisted in making full use of the most important variables.
* Precision & Recall are fine but rely in part on the threshold you are using. I would interested in knowing the distribution of differences in the probability of the event of interest between both models. It is possible that the distribution of probabilities is relatively small even though Precision & Recall differences seem dramatic
* There could be great variability if the event was rare, but you are not looking at a large number of observations either way for such a flexible modeling strategy.
Hope this helps!
Doug
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The standard approach of reducing variables is to run Forest several times, eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse. As Doug alluded, Forests can benefit from using many variable to create a complex model.
-Padraic