Programming the statistical procedures from SAS

Proc Logistics gone CRAZY!

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 84
Accepted Solution

Proc Logistics gone CRAZY!

I have a small data set with only 20 observations.

N= number of trials

Y= number of successes

Here is the data-set:

data temp;

input Y N;

cards;

97870      270000

12890      55000

120071    313000

43446      150000

1405102   1903000

125402     254000

79192       109000

14087       29000

10714       9000

983775     1587000

316543     654000

8592        29000

76061      130000

217492     501000

132423     354000

29163      127000

57013      161000

82747      192000

101778     344000

44258      77000

;

run;

If I create a bunch of totally irrelevant random variable such that:

proc temp;

set temp;

x1=rand('BETA',3,0.1);

x2=rand('CAUCHY');

x3=rand('CHISQ',22);

x4=rand('ERLANG', 7);

x5=rand('EXPO');

x6=rand('F',12,322);

run;

And then run a LOGISTIC regression such as:

Proc Logistic data=temp;

model y/n=x1 x2 x3 x4 x5 x6;

run;

all of the independent variables are statistically significant at p<0.0001 level! despite the fact that they are all random and none of them should logically be significant! I tried it with so many other variables, but it is almost impossible to get an INSIGNIFICANT results from the proc logistic with events/trials option!

Do you know why this is happening?


Accepted Solutions
Solution
‎06-19-2014 09:44 AM
Respected Advisor
Posts: 2,655

Re: Proc Logistics gone CRAZY!

Q1: I am assuming that each observation is a random sample from a universe of possible observations.  Ordinarily, I would just state:

random trial;

but in order to use method=laplace, a subject must be specified, which in this case means rewriting as

random intercept/subject=trial;

In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.

Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values.  The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134).  This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.

Steve Denham

View solution in original post


All Replies
Respected Advisor
Posts: 2,655

Re: Proc Logistics gone CRAZY!

You have an overpowered analysis.  With over 7 million trials, the Wald standard error is going to be very small, and as a result, the Wald chi square very large..

Try the following:

data temp;

call streaminit(123);

set temp;

x1=1-rand('BETA',3,0.1);

x2=rand('CAUCHY');

x3=rand('CHISQ',22);

x4=rand('ERLANG', 7);

x5=rand('EXPO');

x6=rand('F',12,322);

trial=_n_;

run;

proc glimmix data=temp method=laplace;

class trial;

model y/n=x1 x2 x3 x4 x5/solution chisquare;

random intercept/subject=trial;

run;

You will see that none of the type III tests, using the chi-squared value are significant.  Be careful about just summing across trials.

Steve Denham

Frequent Contributor
Posts: 84

Re: Proc Logistics gone CRAZY!

Thank you very much for the great help;

  • would you please elaborate on the random intercept option of the glimmix procedure?  are you assuming that each of the observations have unobserved characteristics and would like to adjust for this by including a random intercept?
  • How about dividing both the number of trials and evens by say 100,000. This will keep the proportion the same, but reduces the total number of trials. Will it solve the problem with Proc logistic?

Thanks

Solution
‎06-19-2014 09:44 AM
Respected Advisor
Posts: 2,655

Re: Proc Logistics gone CRAZY!

Q1: I am assuming that each observation is a random sample from a universe of possible observations.  Ordinarily, I would just state:

random trial;

but in order to use method=laplace, a subject must be specified, which in this case means rewriting as

random intercept/subject=trial;

In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.

Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values.  The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134).  This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.

Steve Denham

Frequent Contributor
Posts: 84

Re: Proc Logistics gone CRAZY!

Thank you very much Steve.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 301 views
  • 3 likes
  • 2 in conversation