Please look at a simple data and a tricky question

bncoxuk · Posted 08-17-2011 03:21 PM

I used the following LOGISTIC procedure to predict the variable group, but cannot understand the output. The output showed that both levels age='4' and age='X' have DF=0 and estimates as 0. While age=’A’ is understandably the reference group, the age=’4’ group is questionable. I checked the data using PROC FREQ and found there is so such problem as quasi-complete separation. Age and sex are not perfectly related. The only problem may be from the fact that:

IF sex is gay (sex=’G’), then age is definitely unknown (age=’X’).

Except this situation, there is no relationship between sex and age.

I tried again by combining age=’4’ into the group of age=’3’ (i.e. age’s levels are 1, 2, 3 and X), and ran the model again. This time, the level age=’3’ has the same problem with DF=0, Estimate=0.

PROC LOGISTIC DATA=work.data DESC;

CLASS sex age;

MODEL group=sex age /AGGREGATE SCALE=NONE; /*To model group=0 or 1*/

RUN;

Analysis of Maximum Likelihood Estimates

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -8.5254 0.0925 8502.1673 <.0001

sex F 1 0.0686 0.0760 0.8144 0.3668

sex G 1 0.1554 0.1115 1.9435 0.1633

sex M 0 0 . . .

age 1 1 0.2709 0.1690 2.5692 0.1090

age 2 1 0.0275 0.1461 0.0354 0.8507

age 3 1 -0.2961 0.1285 5.3097 0.0212

age 4 0 0 . . .

age X 0 0 . . .

art297 · Posted 08-17-2011 05:10 PM

While I'm not versed in logistic regression, I would presume that it follows the same logic as linear regression.

I found it interesting that you had 3 levels of sex .. what is in the 3rd level? Is it male, female and missing?

Regardless, if logistic regression follows the same rules as linear regression then I would expect that you can only have degrees of freedom for n-1 levels. Thus, if you do really have three levels of sex, you can only obtain parameters for 2 of them.

Similarly, if you only have 4 levels of age, you can only obtain parameters for 3 of them, as that would already account for all of the variance.

bncoxuk · Posted 08-17-2011 05:20 PM

Hi art297, thanks for your reply.

Age has 5 levels: 1, 2, 3, 4, and X. So I expect there are 4 estimates (the 5th is 0). But now there is only 3.

Reeza · Posted 08-17-2011 05:19 PM

Art is correct, I don't know why you even have estimates with DF 0 in your output, with similar options as the code you provided SAS does not give me results like that...

Also, Sex=Gay? makes me question your entire model, but for now I'll assume what you're doing makes sense. I'd assume sexuality would be a separate variable, gay vs straight.

You may also want to specify your reference levels because having a reference level of Age='X' or unknown also makes no sense to me.

bncoxuk · Posted 08-17-2011 05:22 PM

Hi Reeza,

It is correct to have DF=0 for a category of a variable in logistic regression. This indicates that this category is the base level (reference group), and other levels are contrasting with this level.

I understand there is definitely a problem here. Need a statistican.

Reeza · Posted 08-17-2011 05:33 PM

I understand what the DF=0 means, but the output you're displaying isn't default output given the code provided ..we could be using diff SAS versions I suppose. V9.2.3 here.

If you have access to a statistician at your site who can view the data then I'd definitely recommend doing that.

I'd be interested in hearing what the issue is in case I ever run into though, as a statistician

bncoxuk · Posted 08-17-2011 06:53 PM

I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:

data work.data;

infile datalines;

input y x1 x2 @@;

datalines;

0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

run;

You can see whenever x1=4, then x2 also eqals to 4.

Let's run some logistic regression:

proc logistic data=work.data;

class x1 x2 /param=glm;

model y=x1 x2 /aggregate scale=none;

run;

Results:

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 4.82E-15 0.7071 0.0000 1.0000

x1 1 1 -1.6953 1.1763 2.0770 0.1495

x1 2 1 1.4020 1.1598 1.4612 0.2267

x1 3 1 -0.4055 0.9574 0.1793 0.6719

x1 4 0 0 . . .

x2 1 1 -0.0799 0.8535 0.0088 0.9254

x2 2 1 1.3451 0.8710 2.3847 0.1225

x2 3 0 0 . . .

x2 4 0 0 . . .

You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.

Let's run a linear regression model using PROC GLM:

proc glm data=work.data;

class x1 x2;

model y=x1 x2/solution ss3;

run;

quit;

The results are the same as below. There is no estimate for x2=3.

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept 0.5000000000 B 0.14821413 3.37 0.0010

x1 1 0.3879750199 B 0.23428608 1.66 0.1005

x1 2 -.2299495084 B 0.22891720 -1.00 0.3173

x1 3 0.1000000000 B 0.19885012 0.50 0.6160

x1 4 0.0000000000 B . . .

x2 1 -.0381211799 B 0.16953676 -0.22 0.8225

x2 2 -.2618788201 B 0.16953676 -1.54 0.1252

x2 3 0.0000000000 B . . .

x2 4 0.0000000000 B . . .

If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.

I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.

Doc_Duke · Posted 08-17-2011 11:20 PM

Your data are overparameterized. See my response in

http://communities.sas.com/thread/30685?tstart=0

Doc Muhlbaier

Duke