I used the following LOGISTIC procedure to predict the variable group, but cannot understand the output. The output showed that both levels age='4' and age='X' have DF=0 and estimates as 0. While age=’A’ is understandably the reference group, the age=’4’ group is questionable. I checked the data using PROC FREQ and found there is so such problem as quasi-complete separation. Age and sex are not perfectly related. The only problem may be from the fact that:
IF sex is gay (sex=’G’), then age is definitely unknown (age=’X’).
Except this situation, there is no relationship between sex and age.
I tried again by combining age=’4’ into the group of age=’3’ (i.e. age’s levels are 1, 2, 3 and X), and ran the model again. This time, the level age=’3’ has the same problem with DF=0, Estimate=0.
PROC LOGISTIC DATA=work.data DESC;
CLASS sex age;
MODEL group=sex age /AGGREGATE SCALE=NONE; /*To model group=0 or 1*/
RUN;
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -8.5254 0.0925 8502.1673 <.0001
sex F 1 0.0686 0.0760 0.8144 0.3668
sex G 1 0.1554 0.1115 1.9435 0.1633
sex M 0 0 . . .
age 1 1 0.2709 0.1690 2.5692 0.1090
age 2 1 0.0275 0.1461 0.0354 0.8507
age 3 1 -0.2961 0.1285 5.3097 0.0212
age 4 0 0 . . .
age X 0 0 . . .
While I'm not versed in logistic regression, I would presume that it follows the same logic as linear regression.
I found it interesting that you had 3 levels of sex .. what is in the 3rd level? Is it male, female and missing?
Regardless, if logistic regression follows the same rules as linear regression then I would expect that you can only have degrees of freedom for n-1 levels. Thus, if you do really have three levels of sex, you can only obtain parameters for 2 of them.
Similarly, if you only have 4 levels of age, you can only obtain parameters for 3 of them, as that would already account for all of the variance.
Hi art297, thanks for your reply.
Age has 5 levels: 1, 2, 3, 4, and X. So I expect there are 4 estimates (the 5th is 0). But now there is only 3.
Art is correct, I don't know why you even have estimates with DF 0 in your output, with similar options as the code you provided SAS does not give me results like that...
Also, Sex=Gay? makes me question your entire model, but for now I'll assume what you're doing makes sense. I'd assume sexuality would be a separate variable, gay vs straight.
You may also want to specify your reference levels because having a reference level of Age='X' or unknown also makes no sense to me.
Hi Reeza,
It is correct to have DF=0 for a category of a variable in logistic regression. This indicates that this category is the base level (reference group), and other levels are contrasting with this level.
I understand there is definitely a problem here. Need a statistican.
I understand what the DF=0 means, but the output you're displaying isn't default output given the code provided ..we could be using diff SAS versions I suppose. V9.2.3 here.
If you have access to a statistician at your site who can view the data then I'd definitely recommend doing that.
I'd be interested in hearing what the issue is in case I ever run into though, as a statistician
I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:
data work.data;
infile datalines;
input y x1 x2 @@;
datalines;
0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1
0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2
0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1
1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1
1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4
0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1
0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2
0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1
1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1
1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4
run;
You can see whenever x1=4, then x2 also eqals to 4.
Let's run some logistic regression:
proc logistic data=work.data;
class x1 x2 /param=glm;
model y=x1 x2 /aggregate scale=none;
run;
Results:
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 4.82E-15 0.7071 0.0000 1.0000
x1 1 1 -1.6953 1.1763 2.0770 0.1495
x1 2 1 1.4020 1.1598 1.4612 0.2267
x1 3 1 -0.4055 0.9574 0.1793 0.6719
x1 4 0 0 . . .
x2 1 1 -0.0799 0.8535 0.0088 0.9254
x2 2 1 1.3451 0.8710 2.3847 0.1225
x2 3 0 0 . . .
x2 4 0 0 . . .
You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.
Let's run a linear regression model using PROC GLM:
proc glm data=work.data;
class x1 x2;
model y=x1 x2/solution ss3;
run;
quit;
The results are the same as below. There is no estimate for x2=3.
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 0.5000000000 B 0.14821413 3.37 0.0010
x1 1 0.3879750199 B 0.23428608 1.66 0.1005
x1 2 -.2299495084 B 0.22891720 -1.00 0.3173
x1 3 0.1000000000 B 0.19885012 0.50 0.6160
x1 4 0.0000000000 B . . .
x2 1 -.0381211799 B 0.16953676 -0.22 0.8225
x2 2 -.2618788201 B 0.16953676 -1.54 0.1252
x2 3 0.0000000000 B . . .
x2 4 0.0000000000 B . . .
If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.
I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.
Your data are overparameterized. See my response in
http://communities.sas.com/thread/30685?tstart=0
Doc Muhlbaier
Duke
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.