09-09-2015 03:23 PM
I am using Enterprise Miner to build some models to predict a dichotomous criterion. I am using the HP Forest node as a way of sort of simulating multiple decision tree runs.
First, I am looking for an easy way to summarize the summary of the decision trees. Any ideas or references here?
I also adjust my cutoff to where the Event Precision Recall is equal – that is at 0.21 even though 10% of my data has the criterion – the outcome. When using this new cutoff my hit-rate (or true positive rate) is at about 50 with a false positive rate at 5. That hit rate is not where I would expect it to be. I was thinking it would be around the 70s. So if I search for a rate a little closer to 0.10 (down from 0.21 and matching back to my original criterion) then my hit rate is 95 but my false positives go to 41!
Another way to look at it is to specify the cutoff based on my expectation of a decent hit rate. Under those circumstances I am able to get to around a 70% using a 0.15 cutoff. The false positive climb to about 14 though. So if I had to guess my cutoff should be somewhere between a .10 and a .21. Should the cutoff be chosen based on the hit rate, false positive rate, and classification rate?
WWSCMD [What Would SAS Community members Do]? Thank you very much in advance.
09-09-2015 04:14 PM
HPForest is not just multiple decision tree runs. HPForest is a very specific type of decision tree ensemble. For each decision tree that you train you are not using every observation, and not all variables are candidates for splits. It does not sound intuitive at first, but Breiman and other authors have demonstrated that this approach works best for a robust model.
Once you decide to use a model with low interpretability like a gradient boosting, a random forest, an SVM, or a neural network, you have traded off interpretability for better prediction. One useful trick to better understand the variables driving your model for a binary target:
1. Add a Model Comparison node, a Score node, and a Reporter node after your model.
2. For the reporter node set the Nodes option to SUMMARY. Run this flow and open the results.
3. Notice that the pdf report ran the Rapid Predictive Modeler reports for your model. This report includes the Selected Variable Importance chart based on a decision tree of your predicted event. You can use this chart to explain the main drivers of your model. I find it easier to use this report even for a model like HPForest that already outputs variable importance. I think this chart is easier to explain than the out-of-bag error reduction, and the results usually match.
Before trying to make a recommendation for WWSCMD, please share some info and charts:
-proportion of events to non-events of your target variable? is it a rare event?
-iteration plot for your HPForest
-plots from your Cutoff node results including ROC, positive rates, and precision recall cutoff
I hope this helps!
09-10-2015 11:10 AM
09-10-2015 11:59 AM
One more question...
I am intrigued by how you say it is a tradeoff between intepretability and other factors.
Can records within the HP Random Forest model be scored? I ask becuase our IT department will ultimately need to build some algorithms outside of SAS to score each respective model. Without that capability I sort of cannot use the HP Forest model.
09-11-2015 12:17 AM
Thanks for including the screenshot and the log. That sure helped!
In general, you don't want to use Model Comparison node to compare the fit statistics of models that you trained on different data sets. There might be some special cases when you do want to combine the posterior probabilities of models trained on different data sets, for example when you are building a special type of ensemble model. But that's another conversation.
Quick fix: Copy-paste the subflow Model Comparison->Score->Reporter two more times. Connect each of your HPForest Models to one of those subflows, run it, and you will have a Reporter that explains the Variable Importance of each of your HPForest models.
Remember, this report is using a decision tree to explain the main drivers of a model.
Why you got this error? From your log, it looks like the reporter node knows what variables in the metadata are used as inputs. It errored out when one of the data sets did not have two of those inputs (ULTIMATE_LITIGATION AND ULTIMATE_RTW). I am not sure if those to input variables were not in one of the data sets from the get-go or if they were not passed. Anyway, the suggested quick fix should get you what you need (except if the decision tree finds no rules, but that would only happen if no inputs drive your predicted target. As long as you see the variable importance chart on your pdf report and the log says something like "NOTE: The data set WORK.RULES has XXX observations and YYY variables.", everything is good.).
About Scoring your HPForest model
In short, the good news is that the Score node writes the SAS code you need to score new observations with your HPForest model.
Open the Score node that you ran in your subflow and you will see the scoring code. HPForest is a special case that uses a specific proc called hp4score to score new observations. The reason to do this through a proc is because traditional SAS code would take a lot of time to read, to write, and it would be a really big file (remember that your hpforest combines hundreds of a special type of trees).
Let me elaborate on the tradeoff of predictability vs explainability. For example let's compare a single decision tree with an hpforest. As a model, it is really easy to explain. From the tree diagram or from the score code you can come up with the set of rules that classify an observation as a predicted event (e.g. if X; or if X and Y; or if X,Y, and Z). But you cannot do the same for an hpforest. Even if you came up with the huge list of rules, you still need to average them. Interpreting a forest is really hard unless you do a workaround like the reporter node, which uses a single decision tree to explain the predicted outcome using the inputs of your hpforest model.
I hope this helps!
09-11-2015 09:58 AM
Thank you very much, again. Below is a listing of the varying issues we are up against, but before then I thought I would talk a little about the error: Regarding ULTIMATE_LITIGATION & ULTIMATE_RTW, these were originally set in the data to Drop. But what I did it make sure it was explicitly Rejected as well. This prevented the error from happening. I would think Drop & Rejected would be synonymous, but...
Regarding the HP Forest Node itself... Does it use some form of bootstrapping to get the varying results? I am a little worried that my cutoff results may be different after a second time running it. Then again, I have it creating a max of 100 trees. Theoretically it should converge.
The Report.pdf file:
Am I correct in assuming that the scoring is developed solely based on the Training data?
Now ultimately I need to come up with a set of scoring rules to submit to our IT department - ultimately to score a model outside of SAS. Originally I had 1,117 variables. The files says it selected 1,093. That is probably just too much. Honestly I am trying to:
(1) Come up with a decent model.
(2) Maybe select a subset of the predictors that comes close to converging on the final model. An analogy would be using discriminant analyses to predict segments that I generated on a much fuller set of data. 1,000+ variables is just too many for my IT department to work with.
The Selected Variable Importance does not list out the names of all the variables. But I think their details occur in an alphabetized way below. I can use the bottom information, but how does it rank them in terms of their importance?
It would be ideal if I could say maybe the top-20 or 30 variables are "good enough" for estimating the overall HP Forest model. I hope that is making sense. Is there enough information contained within here to converge to the solution?
Also note that I am going to also experiment in seeing how I may limit my variable splits to 2 rather than the larger number that happens by default. Am I correct in that this is what I set the Max Categories in Split Search property to? I can set it to 2, down from a default of 30, but it says it only applies to Nominal variables.
Lastly, I am assuming that the Scorecard Points provides the information that our programmers need to program all of this into our system? What do these specifically mean? Can you provide perhaps an example of how this works? How did the reporter node come up with this single tree?
I apologize for all of the questions - but I guess I am back to being a little scare of the utility of the HP Forests. I like the stability, and the solution is good in terms of my hit-rates & false positives. Now I just need to see how it will be implemented in reality - and will working with a basic subset of variables be close enough to get us to where we need to go?
Thank you, again, & as usual.