turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Random Forest vs Decision Tree Node Split Search

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-13-2017 06:58 AM

Hello,

Appreciate if someone can help confirm my current understanding of the Random Forest vs Decision Tree methodologies when it comes to split search for a single node. Let's assume that we are only interested in binary splits.

__Decision Tree__

In a decision tree, the split search for a single node is conducted by maximising the worth by looking at all combinations of binary splits across ALL input variables, but also the values WITHIN an input variable.

For example, the following combinations will be compared to determine the split with highest information gain (IG):

- Age: less than 10 vs older than or equal to 10

- Age: less than 5 vs older than or equal to 5

- Sports: Swimming, Cricket vs Tennis, Running

- Sports: Swimming, Cricket, Tennis vs Running

etc.

In a random forest however, the same split seach is conducted only on ONE input variable based on the following steps

- first, randomly select X. no. of variables (the number of variables selected is specified by the user), and

- from these variables, only one input variable is selected based on an association test

So if Age is selected based on the assocation test, then only combinations for Age will be considered to determine which has the highest IG:

- Age: less than 10 vs older than or equal to 10

- Age: less than 5 vs older than or equal to 5

- Age: less than 30 vs older than or equal to 30

etc.

And, one final question - in the help menu, it says that "The HP Forest node preselects the input with the largest p-value of an asumptotic permutation distribution of an association statistic". Why does the model select the input with the largest p-value? What is the null hypothesis considered?

Thank you!