BookmarkSubscribeRSS Feed
PCKW
Calcite | Level 5

Hello,

 

Appreciate if someone can help confirm my current understanding of the Random Forest vs Decision Tree methodologies when it comes to split search for a single node.  Let's assume that we are only interested in binary splits.

 

Decision Tree

In a decision tree, the split search for a single node is conducted by maximising the worth by looking at all combinations of binary splits across ALL input variables, but also the values WITHIN an input variable.

For example, the following combinations will be compared to determine the split with highest information gain (IG):

- Age: less than 10 vs older than or equal to 10

- Age: less than 5 vs older than or equal to 5

- Sports: Swimming, Cricket vs Tennis, Running

- Sports: Swimming, Cricket, Tennis vs Running

etc.

 

In a random forest however, the same split seach is conducted only on ONE input variable based on the following steps

- first, randomly select X. no. of variables (the number of variables selected is specified by the user), and

- from these variables, only one input variable is selected based on an association test

So if Age is selected based on the assocation test, then only combinations for Age will be considered to determine which has the highest IG:

- Age: less than 10 vs older than or equal to 10

- Age: less than 5 vs older than or equal to 5

- Age: less than 30 vs older than or equal to 30

etc.

 

And, one final question - in the help menu, it says that "The HP Forest node preselects the input with the largest p-value of an asumptotic permutation distribution of an association statistic".  Why does the model select the input with the largest p-value?  What is the null hypothesis considered?

 

Thank you!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 0 replies
  • 1129 views
  • 0 likes
  • 1 in conversation