Random Forest Overfitting

msf2021 — Tue, 24 May 2022 14:50:46 GMT

Hello!

I have built a random forest in SAS Miner for classification task. I have the variable Target (1=event, 0= non event) and i came along with top 20 variables more important. After that, i chose just this 20 and run again HPForest node, and all my metrics are ok between train (split 80%) and test (split 20%) but cumulative % captured response is significantly different between train (~30% in 1st decile) and test (~20% in 1st decile). I found that changing some parameters like mtry and maximum number of trees changes these results but is there a way i can find which are the optimal parameters? Trying different combinations by hand is not easy and I am not able to achieve good results.

I used already this methodology: Tip: Getting the Most from your Random Forest - SAS Support Communities but first it only considers interval inputs and i have interval and categorical ones and also, i cannot achieve better results with this approach...

Thanks

Re: Random Forest Overfitting

sbxkoenk — Tue, 24 May 2022 20:39:02 GMT

Hello @msf2021 ,

What is the variable importance table / importance plot telling you?

Maybe the top 20 variables are only responsible for 50% of the total importance?

You can also have a look here :

SAS Tutorial | How to train forest models in SAS?
https://www.youtube.com/watch?v=FWragzNF59U

SAS Tutorial | How to Pick Hyperparameters of Machine Learning Models?

https://www.youtube.com/watch?v=AOR7XnCB_JA

You can also select the most important variables upfront with other techniques.

Not sure if the PROC VARREDUCE was already available in Enterprise Miner times(?).

Thanks,

Koen

topic Random Forest Overfitting in SAS Data Science

Random Forest Overfitting

Re: Random Forest Overfitting