BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
blund
Obsidian | Level 7

PROC LOGISTIC: The "Error Rate" from FITSTAT does not agree with the "Correct" percent from the CTABLE (with PPROB=.5). In the example below it is easy to see from the PROC FREQ that the error-rate is 0.333, as is given by FITSTAT.

(An OBS is correctly classified if P_1 >= 0.5 and Y=1  OR  P_1 < 0.5 and Y=0.)

From the classification table the "Correct" is 0.417 and so this gives an error-rate of 0.583

What are the rules for the Classification Table that leads to "Correct" of 0.417?

 

DATA TEST1;
DO ID = 1 to 12;
Y = MOD(ID,2);
X1 = (ranuni(1) < .6);
X2 = (ranuni(1) < .4);
X3 = (ranuni(1) < .5);
OUTPUT;
END;
run;
PROC LOGISTIC DATA = TEST1 desc;
MODEL Y = X1-X3 / CTABLE PPROB=0.5;
SCORE DATA = TEST1 OUT=SCORED FITSTAT;
run;
PROC FREQ DATA = scored;
TABLES P_1*Y /norow nocol nopercent;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

See the "Details: Classification Table" section of the LOGISTIC documentation. As described toward the end of that section, the CTABLE results are based on cross validated ("leave one out") predicted probabilities, not on the probabilities obtained by directly applying the fitted model to the individual observations. Cross validation is used to reduce the optimistic, over-estimation of the fit of the model resulting from using the same data to evaluate the model as to fit it. The relevant comparison of using the ordinary predicted probabilities vs the cross validated predicted probabilities is done using the following code. In the OUT= data set, _FROM_ is the observed response; _INTO_ is the predicted response using the ordinary predicted probabilities; CV_predY is the predicted response using the cross validated predicted probabilities. Note that using the ordinary values, the correct probability is (5+3)/12=.67 and using the cross validated value is (2+3)/12=.417. 

PROC LOGISTIC DATA = TEST1 desc;
MODEL Y = X1-X3 / CTABLE PPROB=0.5;
output out=out predprob=(x i);
run;
data out; set out; 
CV_predY=(xp_1>=.5);
run;
PROC FREQ DATA = out;
TABLES _from_*_into_; *ordinary predicted probabilities;
TABLES _from_*CV_predY; *cross validated predicted probabilities;
run; 

 

View solution in original post

2 REPLIES 2
StatDave
SAS Super FREQ

See the "Details: Classification Table" section of the LOGISTIC documentation. As described toward the end of that section, the CTABLE results are based on cross validated ("leave one out") predicted probabilities, not on the probabilities obtained by directly applying the fitted model to the individual observations. Cross validation is used to reduce the optimistic, over-estimation of the fit of the model resulting from using the same data to evaluate the model as to fit it. The relevant comparison of using the ordinary predicted probabilities vs the cross validated predicted probabilities is done using the following code. In the OUT= data set, _FROM_ is the observed response; _INTO_ is the predicted response using the ordinary predicted probabilities; CV_predY is the predicted response using the cross validated predicted probabilities. Note that using the ordinary values, the correct probability is (5+3)/12=.67 and using the cross validated value is (2+3)/12=.417. 

PROC LOGISTIC DATA = TEST1 desc;
MODEL Y = X1-X3 / CTABLE PPROB=0.5;
output out=out predprob=(x i);
run;
data out; set out; 
CV_predY=(xp_1>=.5);
run;
PROC FREQ DATA = out;
TABLES _from_*_into_; *ordinary predicted probabilities;
TABLES _from_*CV_predY; *cross validated predicted probabilities;
run; 

 

blund
Obsidian | Level 7

The CTABLE with PPROBS = (list) utilizes the leave-one-out probabilities in order to decide if an observation is correctly classified. With PROC LOGISTIC it appears there is no way to use "normal" probabilities and multiple PPROBS for classification of observations without some DATA Step coding. If sample is large, the leave-one-out probabilities are essentially the "normal" probabilities and so the issue is moot. HPLOGISTIC supports a "cutpoint" option in the MODEL statement which enables the Partition fit statistics error rate to be based on a single specified cutpoint (and "normal" probabilities).

Thanks to the SAS Community for the rapid response to my post of yesterday.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 439 views
  • 2 likes
  • 2 in conversation