topic High standard errors on estimates of p in logistic regression using categorical predictors in Statistical Procedures

High standard errors on estimates of p in logistic regression using categorical predictors

Kojema — Tue, 06 Nov 2018 01:01:20 GMT

Hello,

I am performing logistic regression in PROC GLIMMIX (binomial distribution, logit link, SAS version 9.4) using categorical predictors and am coming across a strange phenomenon:

If the response variable of certain level of a categorical predictor comprioses mostly or all zero values (my data are represented by 0s ands 1s -I am modelling the probability of occurrence of 1s), when I perform lsmean estimates of the probability of occurrence of 1 (p), the standard error and breadth of 95% CIs for the estimate of p for that level is high; if all values of that level are zero the SE is very high and the 95% CI ranges from 0 to 1. As a result, pairwise post-hoc comparisons made with this level are rarely significant (or never if all values are 0), even though I might expect them to be.

I find this odd because I thought that standard error for a binomially distributed variable becomes smaller when, for a given n, p approaches 0 (or 1).

I have posted some example data and code for a simple example with one independent categorical variable (Cat1).

For level "RE", all values but one of the responding variable "Y1" are 0. For the responding variable "Y2" all values for level "RE" are 0. If you run the code, below, you can see that the SE for RE is high in the former case, and very high (with 95%CIs of p ranging from 0 to 1) in the latter case.

Note-I am using GLIMMIX because I am ultimately included random factors in this model, but I did a bit of experimenting in PROC genmod and proc logistic and I think the same thing more or less occurs.

Can anyone explain what causes this phenomenon, or -in the case it is an error in my coding-what I may be doing wrong?

Thanks for any help!

*case where Y1 has only one non-zero value in Cat1="RE";

PROC GLIMMIX data=Fake_Data2;
CLASS Cat1 ;
MODEL Y1 = Cat1 / DIST=bin LINK=logit SOLUTION;
LSMEANS Cat1 / CL ADJUST= tukey ilink;
RUN;

*case where Y2 has all zero values in Cat1="RE";

PROC GLIMMIX data=Fake_Data2;
CLASS Cat1 ;
MODEL Y2 = Cat1 / DIST=bin LINK=logit SOLUTION ;
LSMEANS Cat1 / CL ADJUST= tukey ilink;
RUN;

Re: High standard errors on estimates of p in logistic regression using categorical predictors

PaigeMiller — Tue, 06 Nov 2018 12:45:34 GMT

@Kojema wrote:

Hello,

I am performing logistic regression in PROC GLIMMIX (binomial distribution, logit link, SAS version 9.4) using categorical predictors and am coming across a strange phenomenon:

If the response variable of certain level of a categorical predictor comprioses mostly or all zero values (my data are represented by 0s ands 1s -I am modelling the probability of occurrence of 1s), when I perform lsmean estimates of the probability of occurrence of 1 (p), the standard error and breadth of 95% CIs for the estimate of p for that level is high; if all values of that level are zero the SE is very high and the 95% CI ranges from 0 to 1. As a result, pairwise post-hoc comparisons made with this level are rarely significant (or never if all values are 0), even though I might expect them to be.

I find this odd because I thought that standard error for a binomially distributed variable becomes smaller when, for a given n, p approaches 0 (or 1).

I have posted some example data and code for a simple example with one independent categorical variable (Cat1).

For level "RE", all values but one of the responding variable "Y1" are 0. For the responding variable "Y2" all values for level "RE" are 0. If you run the code, below, you can see that the SE for RE is high in the former case, and very high (with 95%CIs of p ranging from 0 to 1) in the latter case.

Note-I am using GLIMMIX because I am ultimately included random factors in this model, but I did a bit of experimenting in PROC genmod and proc logistic and I think the same thing more or less occurs.

Can anyone explain what causes this phenomenon, or -in the case it is an error in my coding-what I may be doing wrong?

The probability that you mention accounts only for a portion of the variability of the estimate of p. It also depends on how well the Y variables are predicted by the predictor variable, and it could also be that one level of CAT1 predicts well while the other level of CAT2 predicts poorly. All of these things can inflate the standard errors.