New Contributor
Posts: 3

# Random Forest vs Decision Tree Node Split Search

Hello,

Appreciate if someone can help confirm my current understanding of the Random Forest vs Decision Tree methodologies when it comes to split search for a single node.  Let's assume that we are only interested in binary splits.

Decision Tree

In a decision tree, the split search for a single node is conducted by maximising the worth by looking at all combinations of binary splits across ALL input variables, but also the values WITHIN an input variable.

For example, the following combinations will be compared to determine the split with highest information gain (IG):

- Age: less than 10 vs older than or equal to 10

- Age: less than 5 vs older than or equal to 5

- Sports: Swimming, Cricket vs Tennis, Running

- Sports: Swimming, Cricket, Tennis vs Running

etc.

In a random forest however, the same split seach is conducted only on ONE input variable based on the following steps

- first, randomly select X. no. of variables (the number of variables selected is specified by the user), and

- from these variables, only one input variable is selected based on an association test

So if Age is selected based on the assocation test, then only combinations for Age will be considered to determine which has the highest IG:

- Age: less than 10 vs older than or equal to 10

- Age: less than 5 vs older than or equal to 5

- Age: less than 30 vs older than or equal to 30

etc.

And, one final question - in the help menu, it says that "The HP Forest node preselects the input with the largest p-value of an asumptotic permutation distribution of an association statistic".  Why does the model select the input with the largest p-value?  What is the null hypothesis considered?

Thank you!

Discussion stats
• 0 replies
• 109 views
• 0 likes
• 1 in conversation