BookmarkSubscribeRSS Feed
Season
Lapis Lazuli | Level 10

I am currently building a logistic regression model as a prediction model. I need to perform model internal validation to test if the model worked well.

During the process, I am stuck on the problem of misclassification error. In SCORE statement of PROC LOGISTIC, one can request the computation of misclassification error by adding FITSTAT option to the SCORE statement.

I took a closer look at the computation formula of misclassification error. As SAS Help shows, the formula of misclassification rate is 

Season_0-1681641202041.png. Simply speaking, according to this formula, the proportion of observations that were misclassified is designated as the misclassification rate. In SAS Help, it is stated that an observation is classified into the level with the largest probability. So it means that SAS uses 0.5 as a cut-off to classify the observations by default when the dependent variable follows a binomial distribution. It can be easily inferred that for dependent variables following a binomial distribution, if the posterior probability of "success" of a given observation were larger than 0.5, then the posterior probability of "failure" of that observation would be less than 0.5. As a result, the observation would be classified as "success", according to the method mentioned in SAS Help.

It can be easily understood that 0.5 is not always the "best" cut-off in terms of corresponding to the largest Youden index. However, I have read a few papers on prediction model validation given the prediction model is a logistic regression model. A posterior probability of 0.5 has indeed been used as a cut-off of misclassification error of internal validation of logistic regression prediction model. Gong's work can serve as an example. In Gong's article, he/she compared the ability to correct bias among Bootstrap, Jackknife and cross-validation. 0.5 is set up as the cut-off of misclassification.

So here is my question: in the setting of logistic prediction model validation, where multiple (usually exceeds 100) models are trained via Bootstrap, Jackknife or cross-validation and tested, is a posterior probability of 0.5 an acknowledged and universal cut-off of misclassification errors? Or should the cut-off vary from model to model, with the posterior probability having the largest Youden index to be the cut-off?

Many thanks!

4 REPLIES 4
sbxkoenk
SAS Super FREQ

@Season wrote:

So here is my question: in the setting of logistic prediction model validation, where multiple (usually exceeds 100) models are trained via Bootstrap, Jackknife or cross-validation and tested, is a posterior probability of 0.5 an acknowledged and universal cut-off of misclassification errors? Or should the cut-off vary from model to model, with the posterior probability having the largest Youden index to be the cut-off?


I would think the latter. But I will give it a second thought.

Anyway, you can deviate from the 0.5 cut-off by using pprob option.

 

proc logistic data=train;
 model target = w h a / ctable 
                        pprob = (0.3, 0.5 to 0.8 by 0.1);
 score data=valid out=score;
run;

proc tabulate data=score;
 class f_target i_target;
 table f_target,i_target;
run;
/* end of program */

Cheers,

Koen

 
Season
Lapis Lazuli | Level 10

Thank you, Koen, for your reply! It seems that this problem is ubiquitous in resampling, where multiple samples are created. However, I have not yet found any research addressing this problem. I previously consulted a statistician of my institution, who responded that misclassification error rate obtained in both manners can be reported simultaneously.

sbxkoenk
SAS Super FREQ

SAS® Enterprise Miner: Cutoff Node

 

  • SAS® Enterprise Miner™ 15.2: Reference Help

Cutoff Node

https://go.documentation.sas.com/doc/en/emref/15.2/n1qmjdusj37md5n1as50qvl0tram.htm

 

  • SAS Communities Library Article

Tip: Use the Cutoff Node in SAS® Enterprise Miner™ to Consume the Posterior Probabilities of Your Models Efficiently

Started ‎05-14-2014 | Modified ‎01-06-2016

https://communities.sas.com/t5/SAS-Communities-Library/Tip-Use-the-Cutoff-Node-in-SAS-Enterprise-Min...

 

  • SAS Communities Library Article

Tip: How to build a scorecard using Credit Scoring for SAS® Enterprise Miner™

Started ‎05-26-2015 | Modified ‎01-06-2016

https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-build-a-scorecard-using-Credit-Sco...

 

  • SAS Global Forum 2012 -- Data Mining and Text Analytics

Paper 127-2012
Use of Cutoff and SAS Code Nodes in SAS® Enterprise Miner™ to Determine Appropriate Probability Cutoff Point for Decision Making with Binary Target Models

Yogen Shah, Oklahoma State University, Stillwater, OK

https://support.sas.com/resources/papers/proceedings12/127-2012.pdf

 

BR,

Koen

Season
Lapis Lazuli | Level 10

Wow! 😀Thank you so much, Koen, for your wonderful reply! I never thought of receiving a solution to that problem! I will investigate the literatures you referenced in depth.

Thank you again for bearing my question in mind for such a long time!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1145 views
  • 2 likes
  • 2 in conversation