02-08-2016 03:40 AM
I recently built a scorecard model using SAS E-miner's credit scoring node. The scorecard proved to be very good with excellent gini values and the rank ordering of the scores in terms of the events/non-events was more than satisifactory. However, the event modelled is extremely rare. In the sample, the events (1's) account for roughly 30%, whereas the true population proportion is 0.15%. The person who will be using the model is completely focussed on wanting a cut-off below which all events will be classified as 1's when predicting. The problem I face now is that - because of the rare event - the model DOES accurately capture a large percentage of event's below certain cut-offs when running it on out-of-time data, but it has an extremely large False Positive Rate as well, due to the large number of non-events in the population. I have adjusted the regression intercept for oversampling, but this does not seem to do much i.t.v. of cut-offs. Is there perhaps any techniques that you would recommend?
02-08-2016 07:11 PM
There will always be a trade-off of course between:
- sensitivity = true positive rate (TPR) = hit rate = recall on the one hand AND
- precision = positive predictive value (PPV) on the other hand
Changing the cut-off (decision threshold) in a particular direction will improve one of these 2, but will worsen the other. Inevitably.
Also: The higher the false positive rate, the lower the precision.
Do you already use the Gains table and the Trade-off Plots in the Scorecard node?
The trade-off plots display the approval rate and bad rate against cutoff scores. In credit scoring, trade-off plots are used to show how the approval rate and the bad rate among the accepted applicants depend on the cutoff score. A good scorecard enables the choice of a cutoff score that corresponds to a relatively high approval rate with a relatively low bad rate.
The gains table shows you "Average Marginal Profit" and "Average Total Profit" per score bucket using "Revenue Accepted Good" and "Cost Accepted Bad" (specified by you in the properties). I think the online doc (accessible from within EMiner) provides you with all the formulas.
If you don't want to rely on the Scorecard node for choosing your cut-off you can always consider to use the cut-off node. It will choose the "best" cut-off probability according to the criterion of your choice (you can easily derive which score is mapped to it). For example: Kolmogorov–Smirnov
And important! Enterprise Miner allows decision processing.
SAS® Enterprise Miner™ 14.1 Extension Nodes: Developer's Guide.
Decision Thresholds and Profit Charts (p. 178)
The final classification of a new applicant in the class of good or the class of bad risks will be based on profit considerations.
Some people choose to optimize the F1-score as the best balance between sensitivity and precision.
F1-score is the harmonic mean of precision and sensitivity
in case you want to maximize the F1-score, you can write an optimization to find the best cut-off or simply a simulation (let the cut-off vary between a start and a stop value by an increment and calculate the quality metrics that go with each particular cut-off). Then make a choice.
02-09-2016 12:37 AM
Thanks for the help Koen! I will have a look at the different techniques you mentioned and see which one shows the best performance/classification.