## Please look at a simple data and a tricky question

Frequent Contributor
Posts: 131

# Please look at a simple data and a tricky question

I used the following LOGISTIC procedure to predict the variable group, but cannot understand the output. The output showed that both levels age='4' and age='X' have DF=0 and estimates as 0. While age=’A’ is understandably the reference group, the age=’4’ group is questionable. I checked the data using PROC FREQ and found there is so such problem as quasi-complete separation. Age and sex are not perfectly related. The only problem may be from the fact that:

IF sex is gay (sex=’G’), then age is definitely unknown (age=’X’).

Except this situation, there is no relationship between sex and age.

I tried again by combining age=’4’ into the group of age=’3’ (i.e. age’s levels are 1, 2, 3 and X), and ran the model again. This time, the level age=’3’ has the same problem with DF=0, Estimate=0.

PROC LOGISTIC DATA=work.data DESC;

CLASS sex age;

MODEL group=sex age /AGGREGATE SCALE=NONE; /*To model group=0 or 1*/

RUN;

Analysis of Maximum Likelihood Estimates

Standard          Wald

Parameter        DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept       1     -8.5254      0.0925     8502.1673        <.0001

sex     F       1      0.0686      0.0760        0.8144        0.3668

sex     G       1      0.1554      0.1115        1.9435        0.1633

sex     M       0           0           .         .             .

age     1       1      0.2709      0.1690        2.5692        0.1090

age     2       1      0.0275      0.1461        0.0354        0.8507

age     3       1     -0.2961      0.1285        5.3097        0.0212

age     4       0           0           .         .             .

age     X       0           0           .         .             .

PROC Star
Posts: 7,664

## Please look at a simple data and a tricky question

While I'm not versed in logistic regression, I would presume that it follows the same logic as linear regression.

I found it interesting that you had 3 levels of sex .. what is in the 3rd level?  Is it male, female and missing?

Regardless, if logistic regression follows the same rules as linear regression then I would expect that you can only have degrees of freedom for n-1 levels.  Thus, if you do really have three levels of sex, you can only obtain parameters for 2 of them.

Similarly, if you only have 4 levels of age, you can only obtain parameters for 3 of them, as that would already account for all of the variance.

Frequent Contributor
Posts: 131

## Please look at a simple data and a tricky question

Age has 5 levels: 1, 2, 3, 4, and X. So I expect there are 4 estimates (the 5th is 0). But now there is only 3.

Super User
Posts: 20,735

## Please look at a simple data and a tricky question

Art is correct, I don't know why you even have estimates with DF 0 in your output, with similar options as the code you provided SAS does not give me results like that...

Also, Sex=Gay? makes me question your entire model, but for now I'll assume what you're doing makes sense. I'd assume sexuality would be a separate variable, gay vs straight.

You may also want to specify your reference levels because having a reference level of Age='X' or unknown also makes no sense to me.

Frequent Contributor
Posts: 131

## Please look at a simple data and a tricky question

Hi Reeza,

It is correct to have DF=0 for a category of a variable in logistic regression. This indicates that this category is the base level (reference group), and other levels are contrasting with this level.

I understand there is definitely a problem here. Need a statistican.

Super User
Posts: 20,735

## Please look at a simple data and a tricky question

I understand what the DF=0 means, but the output you're displaying isn't default output given the code provided ..we could be using diff SAS versions I suppose. V9.2.3 here.

If you have access to a statistician at your site who can view the data then I'd definitely recommend doing that.

I'd be interested in hearing what the issue is in case I ever run into though, as a statistician

Frequent Contributor
Posts: 131

## Re: Please look at a simple data and a tricky question

I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:

data work.data;

infile datalines;

input y x1 x2 @@;

datalines;

0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

run;

You can see whenever x1=4, then x2 also eqals to 4.

Let's run some logistic regression:

proc logistic data=work.data;

class x1 x2 /param=glm;

model y=x1 x2 /aggregate scale=none;

run;

Results:

Standard          Wald

Parameter   DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept      1    4.82E-15     0.7071        0.0000       1.0000

x1        1     1     -1.6953      1.1763        2.0770        0.1495

x1        2     1      1.4020      1.1598        1.4612        0.2267

x1        3     1     -0.4055      0.9574        0.1793        0.6719

x1        4     0           0           .         .             .

x2        1     1     -0.0799      0.8535        0.0088        0.9254

x2        2     1      1.3451      0.8710        2.3847        0.1225

x2        3     0           0           .         .             .

x2        4     0           0           .         .             .

You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.

Let's run a linear regression model using PROC GLM:

proc glm data=work.data;

class x1 x2;

model y=x1 x2/solution ss3;

run;

quit;

The results are the same as below. There is no estimate for x2=3.

Standard

Parameter           Estimate             Error    t Value    Pr > |t|

Intercept       0.5000000000 B      0.14821413       3.37      0.0010

x1        1      0.3879750199 B      0.23428608       1.66      0.1005

x1        2       -.2299495084 B      0.22891720      -1.00      0.3173

x1        3     0.1000000000 B      0.19885012       0.50      0.6160

x1        4     0.0000000000 B       .                .         .

x2        1     -.0381211799 B      0.16953676      -0.22      0.8225

x2        2     -.2618788201 B      0.16953676      -1.54      0.1252

x2        3     0.0000000000 B       .                .         .

x2        4     0.0000000000 B       .                .         .

If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.

I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.

Posts: 2,116

## Please look at a simple data and a tricky question

Your data are overparameterized.  See my response in