BookmarkSubscribeRSS Feed
Kojema
Fluorite | Level 6

Hello,

I am performing logistic regression in PROC GLIMMIX (binomial distribution, logit link, SAS version 9.4) using categorical predictors and am coming across a strange phenomenon:

If the response variable of certain level of a categorical predictor comprioses mostly or all zero values (my data are represented by 0s ands 1s -I am modelling the probability of occurrence of 1s), when I perform lsmean estimates of the probability of occurrence of 1 (p), the standard error and breadth of 95% CIs for the estimate of p for that level is high; if all values of that level are zero the SE is very high and the 95% CI ranges from 0 to 1. As a result, pairwise post-hoc comparisons made with this level are rarely significant (or never if all values are 0), even though I might expect them to be.

I find this odd because I thought that standard error for a binomially distributed variable becomes smaller when, for a given n, p approaches 0 (or 1).

 

I have posted some example data and code for a simple example with one independent categorical variable (Cat1).

For level "RE", all values but one of the responding variable "Y1" are 0. For the responding variable "Y2" all values for level "RE" are 0. If you run the code, below, you can see that the SE for RE is high in the former case, and very high (with 95%CIs of p ranging from 0 to 1) in the latter case.

 

Note-I am using GLIMMIX because I am ultimately included random factors in this model, but I did a bit of experimenting in PROC genmod and proc logistic and I think the same thing more or less occurs.

 

Can anyone explain what causes this phenomenon, or -in the case it is an error in my coding-what I may be doing wrong?

 

Thanks for any help! 

 

*case where Y1 has only one non-zero value in Cat1="RE";

 

PROC GLIMMIX data=Fake_Data2;
CLASS Cat1 ;
MODEL Y1 = Cat1 / DIST=bin LINK=logit SOLUTION;
LSMEANS Cat1 / CL ADJUST= tukey ilink;
RUN;

 

*case where Y2 has all zero values in Cat1="RE";

 

PROC GLIMMIX data=Fake_Data2;
CLASS Cat1 ;
MODEL Y2 = Cat1 / DIST=bin LINK=logit SOLUTION ;
LSMEANS Cat1 / CL ADJUST= tukey ilink;
RUN;

1 REPLY 1
PaigeMiller
Diamond | Level 26

@Kojema wrote:

Hello,

I am performing logistic regression in PROC GLIMMIX (binomial distribution, logit link, SAS version 9.4) using categorical predictors and am coming across a strange phenomenon:

If the response variable of certain level of a categorical predictor comprioses mostly or all zero values (my data are represented by 0s ands 1s -I am modelling the probability of occurrence of 1s), when I perform lsmean estimates of the probability of occurrence of 1 (p), the standard error and breadth of 95% CIs for the estimate of p for that level is high; if all values of that level are zero the SE is very high and the 95% CI ranges from 0 to 1. As a result, pairwise post-hoc comparisons made with this level are rarely significant (or never if all values are 0), even though I might expect them to be.

I find this odd because I thought that standard error for a binomially distributed variable becomes smaller when, for a given n, p approaches 0 (or 1).

 

I have posted some example data and code for a simple example with one independent categorical variable (Cat1).

For level "RE", all values but one of the responding variable "Y1" are 0. For the responding variable "Y2" all values for level "RE" are 0. If you run the code, below, you can see that the SE for RE is high in the former case, and very high (with 95%CIs of p ranging from 0 to 1) in the latter case.

 

Note-I am using GLIMMIX because I am ultimately included random factors in this model, but I did a bit of experimenting in PROC genmod and proc logistic and I think the same thing more or less occurs.

 

Can anyone explain what causes this phenomenon, or -in the case it is an error in my coding-what I may be doing wrong?

 


The probability that you mention accounts only for a portion of the variability of the estimate of p. It also depends on how well the Y variables are predicted by the predictor variable, and it could also be that one level of CAT1 predicts well while the other level of CAT2 predicts poorly. All of these things can inflate the standard errors.

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1766 views
  • 0 likes
  • 2 in conversation