Solved: Re: Proc Logistics gone CRAZY!

niam · Posted 06-18-2014 02:09 PM

I have a small data set with only 20 observations.

N= number of trials

Y= number of successes

Here is the data-set:

data temp;

input Y N;

cards;

97870 270000

12890 55000

120071 313000

43446 150000

1405102 1903000

125402 254000

79192 109000

14087 29000

10714 9000

983775 1587000

316543 654000

8592 29000

76061 130000

217492 501000

132423 354000

29163 127000

57013 161000

82747 192000

101778 344000

44258 77000

;

run;

If I create a bunch of totally irrelevant random variable such that:

proc temp;

set temp;

x1=rand('BETA',3,0.1);

x2=rand('CAUCHY');

x3=rand('CHISQ',22);

x4=rand('ERLANG', 7);

x5=rand('EXPO');

x6=rand('F',12,322);

run;

And then run a LOGISTIC regression such as:

Proc Logistic data=temp;

model y/n=x1 x2 x3 x4 x5 x6;

run;

all of the independent variables are statistically significant at p<0.0001 level! despite the fact that they are all random and none of them should logically be significant! I tried it with so many other variables, but it is almost impossible to get an INSIGNIFICANT results from the proc logistic with events/trials option!

Do you know why this is happening?

SteveDenham · Posted 06-19-2014 09:44 AM

Q1: I am assuming that each observation is a random sample from a universe of possible observations. Ordinarily, I would just state:

random trial;

but in order to use method=laplace, a subject must be specified, which in this case means rewriting as

random intercept/subject=trial;

In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.

Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values. The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134). This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.

Steve Denham

View solution in original post

SteveDenham · Posted 06-18-2014 02:44 PM

You have an overpowered analysis. With over 7 million trials, the Wald standard error is going to be very small, and as a result, the Wald chi square very large..

Try the following:

data temp;

call streaminit(123);

set temp;

x1=1-rand('BETA',3,0.1);

x2=rand('CAUCHY');

x3=rand('CHISQ',22);

x4=rand('ERLANG', 7);

x5=rand('EXPO');

x6=rand('F',12,322);

trial=_n_;

run;

proc glimmix data=temp method=laplace;

class trial;

model y/n=x1 x2 x3 x4 x5/solution chisquare;

random intercept/subject=trial;

run;

You will see that none of the type III tests, using the chi-squared value are significant. Be careful about just summing across trials.

Steve Denham

niam · Posted 06-18-2014 03:35 PM

Thank you very much for the great help;

would you please elaborate on the random intercept option of the glimmix procedure? are you assuming that each of the observations have unobserved characteristics and would like to adjust for this by including a random intercept?
How about dividing both the number of trials and evens by say 100,000. This will keep the proportion the same, but reduces the total number of trials. Will it solve the problem with Proc logistic?

Thanks

SteveDenham · Posted 06-19-2014 09:44 AM

Q1: I am assuming that each observation is a random sample from a universe of possible observations. Ordinarily, I would just state:

random trial;

but in order to use method=laplace, a subject must be specified, which in this case means rewriting as

random intercept/subject=trial;

In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.

Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values. The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134). This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.

Steve Denham

niam · Posted 06-19-2014 12:16 PM

Thank you very much Steve.