11-23-2015 04:49 AM
In the interactive decision tree window in SAS Enterprise Miner, it was possible to set different tree propoerties for different nodes in older versions (5.1) of SAS EM. This was particularly useful for developing custom sub-trees for various business segments.
With EM 6.1 and later, this option has been withdrawn. Consequently, the user has to create multiple 'tree's in the 'diagram' and open individual 'interactive' windows for each one.
Additionally, if I am unsatisfied with the 4-way split with variable X1, and want to look at the best 3-way split, then I need to close the interative window, and start afresh.
Lastly, for large datasets, these cases leads to increased run-time and delays.
PS: Refer attached screen shots for details
11-24-2015 09:12 AM
Thanks for the detailed screenshots. They surely helped.
I am not sure why that "Model" tab went away in EM6.1 and on. It seemed useful, although I wonder if behind the scenes it made it slower to relaunch a proc arbor with different options everytime you want to set new properties. Not sure if that was the case, it is just a wild guess.
I totally share your pain if you had to compare 2-way, 3-way, n-way splits and visually assess their logworths. Is it a good alternative to run several trees non-interactively, and just compare their fit statistics on an unbiased test partition? Does that reduce manually looking at all splits of your n-way options?
Something that should definetly help is that someone demonstrated that trees that only have 2-way splits are less likely to overfit. I can dig that reference for you.
I hope this helps!
11-25-2015 05:59 AM
Many thanks for your post.
Basically, to provide a background, suppose for each split I need to check if some business rules/tests are satisfied before moving on. This is typically done in a separate SAS session. An example would be a Basel PD model, where the modeler would like to check if the (n-way) split maintains the same ordering over 5 years of historical data. If the test result is ok, you move on to the next node. If not, you try the next most significant variable. However, the first variable with (n-1)-way split may be better in terms of log-worth than this second variable and also pass the offline test.
Please see below the response to the two options you shared.
Option 1 : Is it a good alternative to run several trees non-interactively, and just compare their fit statistics on an unbiased test partition?
Response : I am afraid no. As noted above, the objective is not to compare 3-way vs 2-way trees. Rather, we need to compae 3-way vs 2-way splits for a specific node or subtree (or business segment).
Option 2 : Something that should definetly help is that someone demonstrated that trees that only have 2-way splits are less likely to overfit.
Response : Please share the document. Overfitting can be addressed by meaningfully pruning the tree using a hold-out sample. The challenge with using 2-way trees is they are typically much deeper (ie with more levels). The lower levels are difficult to interpret. Also, would you know of any literature on the stability of a tree with many levels from one sample to another?
Broadly, for building an interactive tree it is critical the modeler has the option to build custom sub-trees against business rules/tests or business segments. If this feature is resource intensive, then the user can always decide not to use it for a short project.