BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mmaccora
Obsidian | Level 7
Hi,

In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.

For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.

Performances on the test set for a decision threshold of 0.6:

Recall: 50%
Precision: 70%

After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.

The performances on the test set dramatically decreased:

Recall: 20%
Precision: 6%

Does someone know a scientific explanation to this counterintuitive phenomena ?

Thank you for your help,
Marco
1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

 

The standard approach of  reducing variables is to run Forest several times,  eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse.  As Doug alluded, Forests can benefit from using many variable to create a complex model. 

-Padraic

View solution in original post

2 REPLIES 2
DougWielenga
SAS Employee

A few thoughts come to mind...

  * Forests were designed to deal with massive numbers of variables and observations where the structure is unknown and investigating individual variables is temporally or computationally inefficient

  * 8,500 observations and 84 variables --> not a lot of observations/variables for such a flexible modeling method which might make it very easy to overfit the data, particularly with a random forest which builds models on random subsets of observations using random subsets of variables as predictors

  * Of the 84 variables used as predictors, selecting the 20 most important ignored 3/4 of the variables which seem to have assisted in making full use of the most important variables.   

  *  Precision & Recall are fine but rely in part on the threshold you are using.  I would interested in knowing the distribution of differences in the probability of the event of interest between both models.  It is possible that the distribution of probabilities is relatively small even though Precision & Recall differences seem dramatic

  * There could be great variability if the event was rare, but you are not looking at a large number of observations either way for such a flexible modeling strategy.

  

Hope this helps!
Doug

PadraicGNeville
SAS Employee

 

The standard approach of  reducing variables is to run Forest several times,  eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse.  As Doug alluded, Forests can benefit from using many variable to create a complex model. 

-Padraic

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2519 views
  • 1 like
  • 3 in conversation