Solved: Proc logistic rocoptions: problem with contingency table

Elisa97 · Posted 11-14-2023 08:54 AM

Hi, I would like to find a threshold with ROC curve.

I have a quantitative variable var1 and a qualitative variable resp_q.

data tab;
input var1 resp_q$13.;
datalines;
5 Non-répondeur
7 Non-répondeur
7 Répondeur
7 Répondeur
8 Non-répondeur
8 Répondeur
8 Non-répondeur
8 Répondeur
10 Non-répondeur
11 Répondeur
11 Non-répondeur
12 Non-répondeur
13 Répondeur
13 Non-répondeur
13 Répondeur
13 Non-répondeur
14 Répondeur
14 Non-répondeur
16 Non-répondeur
16 Non-répondeur
18 Non-répondeur
;
run;

This is the distribution: proc freq data=tab; table resp_q*var1; run;

Then I do this:

proc logistic data=tab rocoptions(optimal=youden);
model resp_q(event='Répondeur')=var1 / outroc=roc_var1 ;
run;

And this is the output ROC_VAR1:

I take the row where optyouden=1. This is the 3rd row from the bottom. So, the value corresponding to the threshold is var1=8.

If I calculate my contingency table with the threshold 8, I don't have the same thing that the output.

My contingency table:

	Répondeur	Non-répondeur	Total
T+	6	11	17
T-	2	2	4
Total	8	13	21

The output ROC_VAR1:

	Répondeur	Non-répondeur	Total
T+	8	10	18
T-	0	3	3
Total	8	13	21

I don't understand why...

Have you already encountered this problem ?

Thank you.

StatDave · Posted 11-14-2023 11:44 AM

The value of your predictor (VAR1) that corresponds to the optimal threshold can be displayed by specifying VAR1 in the ID statement and using ID=ID in ROCOPTIONS:

proc logistic data=tab rocoptions(optimal=youden id=id);
id var1;
model resp_q(event='Répondeur')=var1 / outroc=roc_var1 ;
run;

If you do that, you will see in the ROC plot that the optimal threshold corresponds to VAR1=14. You can use that to make a variable of predicted response levels and produce the 2x2, predicted by actual table:

data x; set tab; pred=(var1<=14); run;
proc freq data=x; table pred*resp_q; run;

The resulting table agrees with the _POS_, _NEG_, _FALPOS_, and _FALNEG_ values in the OUTROC= table.

View solution in original post

ballardw · Posted 11-14-2023 10:42 AM

You do not show us anything related to how you generated

I take the row where optyouden=1. This is the 3rd row from the bottom. So, the value corresponding to the threshold is var1=8.

If I calculate my contingency table with the threshold 8, I don't have the same thing that the output.

My contingency table:

Répondeur

Non-répondeur

Total

T+

6

11

17

T-

2

2

4

Total

8

13

21

The output ROC_VAR1:

Répondeur

Non-répondeur

Total

T+

8

10

18

T-

0

3

3

Total

8

13

21

So it is pretty hard to say why/why not.

StatDave · Posted 11-14-2023 11:44 AM

The value of your predictor (VAR1) that corresponds to the optimal threshold can be displayed by specifying VAR1 in the ID statement and using ID=ID in ROCOPTIONS:

proc logistic data=tab rocoptions(optimal=youden id=id);
id var1;
model resp_q(event='Répondeur')=var1 / outroc=roc_var1 ;
run;

If you do that, you will see in the ROC plot that the optimal threshold corresponds to VAR1=14. You can use that to make a variable of predicted response levels and produce the 2x2, predicted by actual table:

data x; set tab; pred=(var1<=14); run;
proc freq data=x; table pred*resp_q; run;

The resulting table agrees with the _POS_, _NEG_, _FALPOS_, and _FALNEG_ values in the OUTROC= table.

Elisa97 · Posted 11-15-2023 02:33 AM

Thank you for your hopfull help.

I did the same thing with an other example:

data tab2;
input var1 resp_q$13.;
datalines;
3 Non-répondeur
3 Non-répondeur
3 Non-répondeur
3 Répondeur
4 Non-répondeur
4 Non-répondeur
4 Répondeur
5 Non-répondeur
5 Non-répondeur
5 Répondeur
6 Répondeur
6 Répondeur
7 Non-répondeur
7 Non-répondeur
7 Non-répondeur
7 Non-répondeur
7 Non-répondeur
8 Non-répondeur
8 Répondeur
9 Répondeur
11 Répondeur
;
run;

proc logistic data=tab2 rocoptions(optimal=youden id=id);
id var1;
model resp_q(event='Répondeur')=var1 / outroc=roc_var1 ;
run;

ROC curve of proc logistic:

The threshold of maximum Youden's index is 8.

Output ROC_VAR1:

If I take the row of maximum Youden's index: _POS_=3, _NEG_=12, _FALPOS_=1 and _FALNEG=5.

Then I do this to verify:

data x; set tab2; pred=(var1<=8); run;
proc freq data=x; table pred*resp_q; run;

The resulting table doesn't agree with the _POS_, _NEG, _FALPOS_, and _FALNEG_ values in the OUTROC=ROC_VAR1.

But if I put ">=8" instead of "<=8", it's good:

data x2; set tab2; pred=(var1>=8); run;
proc freq data=x2; table pred*resp_q; run;

Why in the first example I have to use "<=" and in the second ">=" ?

Thank you.

StatDave · Posted 11-15-2023 10:20 AM

That is because the parameter estimate on VAR1 is positive in this example, negative in the previous one.

Elisa97 · Posted 11-22-2023 09:50 AM

Thank you very much !

Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

Re: Proc logistic rocoptions: problem with contingency table

SAS Innovate 2025: Call for Content