Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- High standard errors on estimates of p in logistic regression using ca...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 11-05-2018 08:01 PM
(1566 views)

Hello,

I am performing logistic regression in PROC GLIMMIX (binomial distribution, logit link, SAS version 9.4) using categorical predictors and am coming across a strange phenomenon:

If the response variable of certain level of a categorical predictor comprioses mostly or all zero values (my data are represented by 0s ands 1s -I am modelling the probability of occurrence of 1s), when I perform lsmean estimates of the probability of occurrence of 1 (p), the standard error and breadth of 95% CIs for the estimate of p for that level is high; if all values of that level are zero the SE is very high and the 95% CI ranges from 0 to 1. As a result, pairwise post-hoc comparisons made with this level are rarely significant (or never if all values are 0), even though I might expect them to be.

I find this odd because I thought that standard error for a binomially distributed variable becomes smaller when, for a given n, p approaches 0 (or 1).

I have posted some example data and code for a simple example with one independent categorical variable (Cat1).

For level "RE", all values but one of the responding variable "Y1" are 0. For the responding variable "Y2" all values for level "RE" are 0. If you run the code, below, you can see that the SE for RE is high in the former case, and very high (with 95%CIs of p ranging from 0 to 1) in the latter case.

Note-I am using GLIMMIX because I am ultimately included random factors in this model, but I did a bit of experimenting in PROC genmod and proc logistic and I think the same thing more or less occurs.

Can anyone explain what causes this phenomenon, or -in the case it is an error in my coding-what I may be doing wrong?

Thanks for any help!

*case where Y1 has only one non-zero value in Cat1="RE";

PROC GLIMMIX data=Fake_Data2;

CLASS Cat1 ;

MODEL Y1 = Cat1 / DIST=bin LINK=logit SOLUTION;

LSMEANS Cat1 / CL ADJUST= tukey ilink;

RUN;

*case where Y2 has all zero values in Cat1="RE";

PROC GLIMMIX data=Fake_Data2;

CLASS Cat1 ;

MODEL Y2 = Cat1 / DIST=bin LINK=logit SOLUTION ;

LSMEANS Cat1 / CL ADJUST= tukey ilink;

RUN;

1 REPLY 1

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Kojema wrote:

Hello,

I am performing logistic regression in PROC GLIMMIX (binomial distribution, logit link, SAS version 9.4) using categorical predictors and am coming across a strange phenomenon:

If the response variable of certain level of a categorical predictor comprioses mostly or all zero values (my data are represented by 0s ands 1s -I am modelling the probability of occurrence of 1s), when I perform lsmean estimates of the probability of occurrence of 1 (p), the standard error and breadth of 95% CIs for the estimate of p for that level is high; if all values of that level are zero the SE is very high and the 95% CI ranges from 0 to 1. As a result, pairwise post-hoc comparisons made with this level are rarely significant (or never if all values are 0), even though I might expect them to be.

I find this odd because I thought that standard error for a binomially distributed variable becomes smaller when, for a given n, p approaches 0 (or 1).

I have posted some example data and code for a simple example with one independent categorical variable (Cat1).

For level "RE", all values but one of the responding variable "Y1" are 0. For the responding variable "Y2" all values for level "RE" are 0. If you run the code, below, you can see that the SE for RE is high in the former case, and very high (with 95%CIs of p ranging from 0 to 1) in the latter case.

Note-I am using GLIMMIX because I am ultimately included random factors in this model, but I did a bit of experimenting in PROC genmod and proc logistic and I think the same thing more or less occurs.

Can anyone explain what causes this phenomenon, or -in the case it is an error in my coding-what I may be doing wrong?

The probability that you mention accounts only for a portion of the variability of the estimate of p. It also depends on how well the Y variables are predicted by the predictor variable, and it could also be that one level of CAT1 predicts well while the other level of CAT2 predicts poorly. All of these things can inflate the standard errors.

--

Paige Miller

Paige Miller

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.