## The effect of one variable's level cannot be estimated if...

Frequent Contributor
Posts: 131

# The effect of one variable's level cannot be estimated if...

I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:

data work.data;

infile datalines;

input y x1 x2 @@;

datalines;

0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

run;

You can see whenever x1=4, then x2 also eqals to 4.

Let's run some logistic regression:

proc logistic data=work.data;

class x1 x2 /param=glm;

model y=x1 x2 /aggregate scale=none;

run;

Results:

Standard          Wald

Parameter   DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept      1    4.82E-15     0.7071        0.0000       1.0000

x1        1     1     -1.6953      1.1763        2.0770        0.1495

x1        2     1      1.4020      1.1598        1.4612        0.2267

x1        3     1     -0.4055      0.9574        0.1793        0.6719

x1        4     0           0           .         .             .

x2        1     1     -0.0799      0.8535        0.0088        0.9254

x2        2     1      1.3451      0.8710        2.3847        0.1225

x2        3     0           0           .         .             .

x2        4     0           0           .         .             .

You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.

Let's run a linear regression model using PROC GLM:

proc glm data=work.data;

class x1 x2;

model y=x1 x2/solution ss3;

run;

quit;

The results are the same as below. There is no estimate for x2=3.

Standard

Parameter           Estimate             Error    t Value    Pr > |t|

Intercept       0.5000000000 B      0.14821413       3.37      0.0010

x1        1      0.3879750199 B      0.23428608       1.66      0.1005

x1        2       -.2299495084 B      0.22891720      -1.00      0.3173

x1        3      0.1000000000 B      0.19885012       0.50      0.6160

x1        4      0.0000000000 B       .                .         .

x2        1      -.0381211799 B      0.16953676      -0.22      0.8225

x2        2      -.2618788201 B      0.16953676      -1.54      0.1252

x2        3     0.0000000000 B       .                .         .

x2        4     0.0000000000 B       .                .         .

If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.

I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.

Any better thoughts, statistically?

Posts: 2,124

## The effect of one variable's level cannot be estimated if...

Correct.  You can see it more clearly with continuous variables.  If you have a model with, say, systolic blood pressure, diastolic blood pressure and pulse pressure (= (systolic - diastolic) ), then any two variables carry the same infomation as the third.  You can actually write it out

y = int + b1*systolic + b2*diastolic + b3*pulse

= int + b1*systolic + b2*diastolic + b3*(systolic - diastolic)

= int + b1*systolic + b2*diastolic + b3*systolic - b3*diastolic

=int + (b1+b3)*systolic + (b2-b3)*diastolic

Frequent Contributor
Posts: 131

## The effect of one variable's level cannot be estimated if...

Hi Duke, your example gives pulse pressue as being perfectly explained by systolic and diastolic:  pulse pressure (= (systolic - diastolic) )

But in my data, the relationship is not that tight: not the whole variable is perfectly explained, but only one of its 4 levels.

I assume the principle is the same.

Super User
Posts: 23,662

## The effect of one variable's level cannot be estimated if...

For regression with categorical variables essentially dummy variables get created for n-1 levels with the nth level being all 0.

So if one level correlates perfectly with another level in a different categorical variable, the two are linear combinations of each other, same as Doc@Duke example.

Typically you can change the way you categorize the variables to help solve this one, but in your case with all missing for a particular population it might be tricky. You also have systematic missing data, which is a problem in of itself.

Discussion stats
• 3 replies
• 242 views
• 3 likes
• 3 in conversation