BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
niam
Quartz | Level 8

I have a small data set with only 20 observations.

N= number of trials

Y= number of successes

Here is the data-set:

data temp;

input Y N;

cards;

97870      270000

12890      55000

120071    313000

43446      150000

1405102   1903000

125402     254000

79192       109000

14087       29000

10714       9000

983775     1587000

316543     654000

8592        29000

76061      130000

217492     501000

132423     354000

29163      127000

57013      161000

82747      192000

101778     344000

44258      77000

;

run;

If I create a bunch of totally irrelevant random variable such that:

proc temp;

set temp;

x1=rand('BETA',3,0.1);

x2=rand('CAUCHY');

x3=rand('CHISQ',22);

x4=rand('ERLANG', 7);

x5=rand('EXPO');

x6=rand('F',12,322);

run;

And then run a LOGISTIC regression such as:

Proc Logistic data=temp;

model y/n=x1 x2 x3 x4 x5 x6;

run;

all of the independent variables are statistically significant at p<0.0001 level! despite the fact that they are all random and none of them should logically be significant! I tried it with so many other variables, but it is almost impossible to get an INSIGNIFICANT results from the proc logistic with events/trials option!

Do you know why this is happening?

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Q1: I am assuming that each observation is a random sample from a universe of possible observations.  Ordinarily, I would just state:

random trial;

but in order to use method=laplace, a subject must be specified, which in this case means rewriting as

random intercept/subject=trial;

In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.

Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values.  The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134).  This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.

Steve Denham

View solution in original post

4 REPLIES 4
SteveDenham
Jade | Level 19

You have an overpowered analysis.  With over 7 million trials, the Wald standard error is going to be very small, and as a result, the Wald chi square very large..

Try the following:

data temp;

call streaminit(123);

set temp;

x1=1-rand('BETA',3,0.1);

x2=rand('CAUCHY');

x3=rand('CHISQ',22);

x4=rand('ERLANG', 7);

x5=rand('EXPO');

x6=rand('F',12,322);

trial=_n_;

run;

proc glimmix data=temp method=laplace;

class trial;

model y/n=x1 x2 x3 x4 x5/solution chisquare;

random intercept/subject=trial;

run;

You will see that none of the type III tests, using the chi-squared value are significant.  Be careful about just summing across trials.

Steve Denham

niam
Quartz | Level 8

Thank you very much for the great help;

  • would you please elaborate on the random intercept option of the glimmix procedure?  are you assuming that each of the observations have unobserved characteristics and would like to adjust for this by including a random intercept?
  • How about dividing both the number of trials and evens by say 100,000. This will keep the proportion the same, but reduces the total number of trials. Will it solve the problem with Proc logistic?

Thanks

SteveDenham
Jade | Level 19

Q1: I am assuming that each observation is a random sample from a universe of possible observations.  Ordinarily, I would just state:

random trial;

but in order to use method=laplace, a subject must be specified, which in this case means rewriting as

random intercept/subject=trial;

In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.

Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values.  The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134).  This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.

Steve Denham

niam
Quartz | Level 8

Thank you very much Steve.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1964 views
  • 3 likes
  • 2 in conversation