Hi all,
I have a large data set that I have sampled (oversampled) and have a few models(Decision Tree, regression, neural) in a workflow. In addition, there is an ensemble model of the decision tree and regression. I have these 4 models going to a model comparison node. My question is where to put the decision node(s) to adjust the priors. In the Sas EM 14.1 Help for the Ensemble Node there it states "When you create a process flow diagram that contains an Ensemble node, do not specify prior probabilities in the diagram before the modeling nodes. In order to obtain the correct fit statistics for the combined unadjusted posterior probabilities, follow the Ensemble node in your diagram with a Decisions node, and use the Decisions node to specify probabilities." Should I use multiple Decision Nodes, one for each model before the model comparison node or just one after the model comparison node? Should the priors be adjusted for the model assessment or for the scoring?
Thanks for the help
Larry
Larry,
First of all, kudos on reading the documentation! I will confess that I missed that detail in the documentation. As a general rule, altering the posterior probabilities to be centered closer to the population values does not change the sort order of the observations. Also, the probability estimated by the model would likely be optimistic even if the data set was not oversampled since the model is typically optimized on the data used to build/validate it. As a result, the best assessment of model performance comes from putting the model into use. SAS Model Manager is a product designed to monitor model performance over time and can perform retraining when the performance declines. Since models do not tend to perform as well in practice as they have on the training/validation data (e.g. because time has passed, market penetration has changed, economic pressures might be different, etc...), I would have had no issue in assigning prior probabilities and decision weights in the Input Data Source node and then including the Ensemble node later. The probabilities themselves are not as much of a concern to me as the sort order of the resulting scored data.
I have talked with one customer for whom the predicted probabilities themselves were quite important, but it is important to note that each observation in the data will either have the event or not in a binary target scenario. Probability only makes sense when looking at subgroups of observations. Since the adjustment for priors really impacts where the probabilities are centered, it is possible that some groups might represent resulting probabilities higher than the adjustment suggested while other groups have probabilities that are lower. The Decisions node allows you to assign weights which can then be multiplied by the probability of each event to determine which outcome is the most profitable (or least costly). In the end, these calcuations are attempting to represent possible business goals. I always recommend a more direct approach, however, where you set up the priors and decisions weights in the Input Data Source so that they are available to the modeling nodes but then focus on the sort order of the results paying less attention to the computed probabilities or the 'Decision' unless the decision weights completely represent the business objective.
When all is said and done, your mileage might differ in which case you might consider trying both approaches -- one specifying decision weights and priors in the Input Data Source node and the other not specifying them at all prior to modeling -- and then choose the approach which seems to perform best on your data. I am doubtful that going through the extra work of setting up a Decisions node after each node which the documentation could be interpreted to suggest will be as good a use of your time as investigating more models. If you are intent on getting probabilities that have been adjusted overall to be more like the population, using the Decisions node after the modeling node is the only way to do that.
Let me know what you think.
Cordially,
Doug
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.