An Idea Exchange for SAS software and services

Comments
by SAS Employee JasonXin
on ‎07-08-2014 04:34 PM

I suggest that this idea NOT to be considered, with or without relating to SAS EM node functionality. Generally this idea reflects fundamental lack of understanding of the relationship between a decision tree model and a forest of such trees, the motivation of needing a forest to begin with. The idea to visualize 'trees under the forest' bears the same intellectual and technical shortage. This is not about SAS product and function. This is about what machine should do and what the human should do.

The forest is designed to overcome and compensate snap-shot nature of one decision tree by way of repeated sub-sampling, hoping random sub sampling as thus will capture a bit and a bit, and more and more of the future and the elements of the future have a SAY in the snap-shot moment when the model is being built. Function mechanics wise, a random forest typically consists of hundreds of component models. My rule of thumb is if I have one million observations with 200 variables (which is no where near so-called BIG DATA today), I will ask a random forest to build at least 2000 sub trees and will not take LESS as answer. Generally, any random forest, unless you are conducting 'gourmet 'type of  analysis with it, should take more than 200 trees. As a practical matter, you don't have time to 'study' (pruning or not) any one of them. If you do dive into and zoom on a specific tree under the forest, I am afraid whatever you come up with will NOT have analytical foundation to be elevated to 'knock out' others in the forest. If you, unfortunately, argue you do have such foundation, I bet your goal or motivation of doing so may not be fairly analytical. You may be in the business of being a detective rather than being predictive. It is possible you may be more into 'fishing' for what you want to see, instead of respecting what the forest has to tell you. Yet, to accomplish that is not easy under the forest. I would recommend you just stick with a regular decision tree and start with defining and refining the sample called modeling universe. In the best interest of your sanity, if you dive into the forest to study or to prune up a tree, the chance is it is next-to-impossible to characterize what the modeling sample/universe.

If a user cares to prune a lot of trees manually, the user can start easily start with one regular tree, and easily copy/paste it out to modify. The modification, including pruning, can happen very efficiently. The user can alter pruning any way desired. The user can still use Metadata node to alter input sample scope. The Model Comparison Node should be greatly leveraged for this crafting exercises. I am not against manual pruning of many trees. I myself have done that many times over the past two decades. I am simply saying the wish to enable similar extension under HPFOREST defeats the very goal and spirit of running a RANDOM forest. Above and beyond, the key word in random forest modeling is the word RANDOM: if you really dive into the forest to prune a tree, is this RANDOM at all?

by SAS Employee JasonXin
on ‎07-08-2014 04:51 PM

Adding logistic and NN into HPFOREST:

If the intention or interest is  to horse variable selection for random forest modeling, I am afraid the best, technically compatible approach is another random forest.

If the desire is to say: look, I like the random subsampling ability, I just don't like the tree being the elementary method composing the forest. Well, if superior model lift is the evil you try to conquer, rest assured that the incremental lift, if any (there is no guarantee a random forest will outperform a single tree or a single NN or a single logistic regression), typically draws from the sampling not the elementary method. The reason, historically, that decision tree has been picked to serve the random forest exercise is its apparent noise friendliness as compared to logistic or NN. This selection is made based on potency of the elementary method's ability to draw reliable information value against a random sample, a sample drawn on the fly that tends to defy cleansing.

If the user is much into building many logistic or NN models on repeated sampling, a great first step is to take a look at START and END group processing facility inside Enterprise Miner. Model building using Enterprise Miner is to engage the full package, not to try to expand a specific node too much. Just like a degree program. If you want to do beyond your current major, it is very OK to go inter-disciplinary, instead of asking your department head to expand the program.

Idea Statuses
Top Liked Authors