I have a small data set with only 20 observations.
N= number of trials
Y= number of successes
Here is the data-set:
data temp;
input Y N;
cards;
97870 270000
12890 55000
120071 313000
43446 150000
1405102 1903000
125402 254000
79192 109000
14087 29000
10714 9000
983775 1587000
316543 654000
8592 29000
76061 130000
217492 501000
132423 354000
29163 127000
57013 161000
82747 192000
101778 344000
44258 77000
;
run;
If I create a bunch of totally irrelevant random variable such that:
proc temp;
set temp;
x1=rand('BETA',3,0.1);
x2=rand('CAUCHY');
x3=rand('CHISQ',22);
x4=rand('ERLANG', 7);
x5=rand('EXPO');
x6=rand('F',12,322);
run;
And then run a LOGISTIC regression such as:
Proc Logistic data=temp;
model y/n=x1 x2 x3 x4 x5 x6;
run;
all of the independent variables are statistically significant at p<0.0001 level! despite the fact that they are all random and none of them should logically be significant! I tried it with so many other variables, but it is almost impossible to get an INSIGNIFICANT results from the proc logistic with events/trials option!
Do you know why this is happening?
Q1: I am assuming that each observation is a random sample from a universe of possible observations. Ordinarily, I would just state:
random trial;
but in order to use method=laplace, a subject must be specified, which in this case means rewriting as
random intercept/subject=trial;
In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.
Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values. The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134). This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.
Steve Denham
You have an overpowered analysis. With over 7 million trials, the Wald standard error is going to be very small, and as a result, the Wald chi square very large..
Try the following:
data temp;
call streaminit(123);
set temp;
x1=1-rand('BETA',3,0.1);
x2=rand('CAUCHY');
x3=rand('CHISQ',22);
x4=rand('ERLANG', 7);
x5=rand('EXPO');
x6=rand('F',12,322);
trial=_n_;
run;
proc glimmix data=temp method=laplace;
class trial;
model y/n=x1 x2 x3 x4 x5/solution chisquare;
random intercept/subject=trial;
run;
You will see that none of the type III tests, using the chi-squared value are significant. Be careful about just summing across trials.
Steve Denham
Thank you very much for the great help;
Thanks
Q1: I am assuming that each observation is a random sample from a universe of possible observations. Ordinarily, I would just state:
random trial;
but in order to use method=laplace, a subject must be specified, which in this case means rewriting as
random intercept/subject=trial;
In GLIMMIX, these are equivalent, and no implication on intercept needs to be made.
Q2: Dividing by 100,000 would reduce the total to about 72, so instead I divided by 10,000 and used the floor function to give integer values. The p values I obtained were x1 (0.0689), x2 (0.0233), x3 (0.0372), x4 (0.4644) and x5 (0.1134). This strongly implies that the very small p values obtained in the first analysis are due to overpowering the analysis, as LOGISTIC "consolidates" all values when calculating standard errors.
Steve Denham
Thank you very much Steve.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.