I am currently building a logistic regression model as a prediction model. I need to perform model internal validation to test if the model worked well.
During the process, I am stuck on the problem of misclassification error. In SCORE statement of PROC LOGISTIC, one can request the computation of misclassification error by adding FITSTAT option to the SCORE statement.
I took a closer look at the computation formula of misclassification error. As SAS Help shows, the formula of misclassification rate is
. Simply speaking, according to this formula, the proportion of observations that were misclassified is designated as the misclassification rate. In SAS Help, it is stated that an observation is classified into the level with the largest probability. So it means that SAS uses 0.5 as a cut-off to classify the observations by default when the dependent variable follows a binomial distribution. It can be easily inferred that for dependent variables following a binomial distribution, if the posterior probability of "success" of a given observation were larger than 0.5, then the posterior probability of "failure" of that observation would be less than 0.5. As a result, the observation would be classified as "success", according to the method mentioned in SAS Help.
It can be easily understood that 0.5 is not always the "best" cut-off in terms of corresponding to the largest Youden index. However, I have read a few papers on prediction model validation given the prediction model is a logistic regression model. A posterior probability of 0.5 has indeed been used as a cut-off of misclassification error of internal validation of logistic regression prediction model. Gong's work can serve as an example. In Gong's article, he/she compared the ability to correct bias among Bootstrap, Jackknife and cross-validation. 0.5 is set up as the cut-off of misclassification.
So here is my question: in the setting of logistic prediction model validation, where multiple (usually exceeds 100) models are trained via Bootstrap, Jackknife or cross-validation and tested, is a posterior probability of 0.5 an acknowledged and universal cut-off of misclassification errors? Or should the cut-off vary from model to model, with the posterior probability having the largest Youden index to be the cut-off?
Many thanks!
@Season wrote:
So here is my question: in the setting of logistic prediction model validation, where multiple (usually exceeds 100) models are trained via Bootstrap, Jackknife or cross-validation and tested, is a posterior probability of 0.5 an acknowledged and universal cut-off of misclassification errors? Or should the cut-off vary from model to model, with the posterior probability having the largest Youden index to be the cut-off?
I would think the latter. But I will give it a second thought.
Anyway, you can deviate from the 0.5 cut-off by using pprob option.
proc logistic data=train;
model target = w h a / ctable
pprob = (0.3, 0.5 to 0.8 by 0.1);
score data=valid out=score;
run;
proc tabulate data=score;
class f_target i_target;
table f_target,i_target;
run;
/* end of program */
Cheers,
Koen
Thank you, Koen, for your reply! It seems that this problem is ubiquitous in resampling, where multiple samples are created. However, I have not yet found any research addressing this problem. I previously consulted a statistician of my institution, who responded that misclassification error rate obtained in both manners can be reported simultaneously.
SAS® Enterprise Miner: Cutoff Node
Cutoff Node
https://go.documentation.sas.com/doc/en/emref/15.2/n1qmjdusj37md5n1as50qvl0tram.htm
Tip: Use the Cutoff Node in SAS® Enterprise Miner™ to Consume the Posterior Probabilities of Your Models Efficiently
Started 05-14-2014 | Modified 01-06-2016
Tip: How to build a scorecard using Credit Scoring for SAS® Enterprise Miner™
Started 05-26-2015 | Modified 01-06-2016
Paper 127-2012
Use of Cutoff and SAS Code Nodes in SAS® Enterprise Miner™ to Determine Appropriate Probability Cutoff Point for Decision Making with Binary Target Models
Yogen Shah, Oklahoma State University, Stillwater, OK
https://support.sas.com/resources/papers/proceedings12/127-2012.pdf
BR,
Koen
Wow! 😀Thank you so much, Koen, for your wonderful reply! I never thought of receiving a solution to that problem! I will investigate the literatures you referenced in depth.
Thank you again for bearing my question in mind for such a long time!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.