Solved: Re: PROC LOGISTIC: Error Rate from FITSTAT disagrees with Incorrect Pe...

blund · Posted 07-29-2023 09:17 PM

PROC LOGISTIC: The "Error Rate" from FITSTAT does not agree with the "Correct" percent from the CTABLE (with PPROB=.5). In the example below it is easy to see from the PROC FREQ that the error-rate is 0.333, as is given by FITSTAT.

(An OBS is correctly classified if P_1 >= 0.5 and Y=1 OR P_1 < 0.5 and Y=0.)

From the classification table the "Correct" is 0.417 and so this gives an error-rate of 0.583

What are the rules for the Classification Table that leads to "Correct" of 0.417?

DATA TEST1;

DO ID = 1 to 12;

Y = MOD(ID,2);

X1 = (ranuni(1) < .6);

X2 = (ranuni(1) < .4);

X3 = (ranuni(1) < .5);

OUTPUT;

END;

run;

PROC LOGISTIC DATA = TEST1 desc;

MODEL Y = X1-X3 / CTABLE PPROB=0.5;

SCORE DATA = TEST1 OUT=SCORED FITSTAT;

run;

PROC FREQ DATA = scored;

TABLES P_1*Y /norow nocol nopercent;

run;

StatDave · Posted 07-29-2023 10:52 PM

See the "Details: Classification Table" section of the LOGISTIC documentation. As described toward the end of that section, the CTABLE results are based on cross validated ("leave one out") predicted probabilities, not on the probabilities obtained by directly applying the fitted model to the individual observations. Cross validation is used to reduce the optimistic, over-estimation of the fit of the model resulting from using the same data to evaluate the model as to fit it. The relevant comparison of using the ordinary predicted probabilities vs the cross validated predicted probabilities is done using the following code. In the OUT= data set, _FROM_ is the observed response; _INTO_ is the predicted response using the ordinary predicted probabilities; CV_predY is the predicted response using the cross validated predicted probabilities. Note that using the ordinary values, the correct probability is (5+3)/12=.67 and using the cross validated value is (2+3)/12=.417.

PROC LOGISTIC DATA = TEST1 desc;
MODEL Y = X1-X3 / CTABLE PPROB=0.5;
output out=out predprob=(x i);
run;
data out; set out; 
CV_predY=(xp_1>=.5);
run;
PROC FREQ DATA = out;
TABLES _from_*_into_; *ordinary predicted probabilities;
TABLES _from_*CV_predY; *cross validated predicted probabilities;
run;

View solution in original post

StatDave · Posted 07-29-2023 10:52 PM

See the "Details: Classification Table" section of the LOGISTIC documentation. As described toward the end of that section, the CTABLE results are based on cross validated ("leave one out") predicted probabilities, not on the probabilities obtained by directly applying the fitted model to the individual observations. Cross validation is used to reduce the optimistic, over-estimation of the fit of the model resulting from using the same data to evaluate the model as to fit it. The relevant comparison of using the ordinary predicted probabilities vs the cross validated predicted probabilities is done using the following code. In the OUT= data set, _FROM_ is the observed response; _INTO_ is the predicted response using the ordinary predicted probabilities; CV_predY is the predicted response using the cross validated predicted probabilities. Note that using the ordinary values, the correct probability is (5+3)/12=.67 and using the cross validated value is (2+3)/12=.417.

PROC LOGISTIC DATA = TEST1 desc;
MODEL Y = X1-X3 / CTABLE PPROB=0.5;
output out=out predprob=(x i);
run;
data out; set out; 
CV_predY=(xp_1>=.5);
run;
PROC FREQ DATA = out;
TABLES _from_*_into_; *ordinary predicted probabilities;
TABLES _from_*CV_predY; *cross validated predicted probabilities;
run;

blund · Posted 07-30-2023 02:51 PM

The CTABLE with PPROBS = (list) utilizes the leave-one-out probabilities in order to decide if an observation is correctly classified. With PROC LOGISTIC it appears there is no way to use "normal" probabilities and multiple PPROBS for classification of observations without some DATA Step coding. If sample is large, the leave-one-out probabilities are essentially the "normal" probabilities and so the issue is moot. HPLOGISTIC supports a "cutpoint" option in the MODEL statement which enables the Partition fit statistics error rate to be based on a single specified cutpoint (and "normal" probabilities).

Thanks to the SAS Community for the rapid response to my post of yesterday.

PROC LOGISTIC: Error Rate from FITSTAT disagrees with Incorrect Percent from CTABLE, using PPROB=.5

Re: PROC LOGISTIC: Error Rate from FITSTAT disagrees with Incorrect Percent from CTABLE, using PPROB

Re: PROC LOGISTIC: Error Rate from FITSTAT disagrees with Incorrect Percent from CTABLE, using PPROB

Re: PROC LOGISTIC: Error Rate from FITSTAT disagrees with Incorrect Percent from CTABLE, using PPROB