The effect of one variable's level cannot be estimated if...

bncoxuk · Posted 08-17-2011 07:37 PM

I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:

data work.data;

infile datalines;

input y x1 x2 @@;

datalines;

0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

run;

You can see whenever x1=4, then x2 also eqals to 4.

Let's run some logistic regression:

proc logistic data=work.data;

class x1 x2 /param=glm;

model y=x1 x2 /aggregate scale=none;

run;

Results:

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 4.82E-15 0.7071 0.0000 1.0000

x1 1 1 -1.6953 1.1763 2.0770 0.1495

x1 2 1 1.4020 1.1598 1.4612 0.2267

x1 3 1 -0.4055 0.9574 0.1793 0.6719

x1 4 0 0 . . .

x2 1 1 -0.0799 0.8535 0.0088 0.9254

x2 2 1 1.3451 0.8710 2.3847 0.1225

x2 3 0 0 . . .

x2 4 0 0 . . .

You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.

Let's run a linear regression model using PROC GLM:

proc glm data=work.data;

class x1 x2;

model y=x1 x2/solution ss3;

run;

quit;

The results are the same as below. There is no estimate for x2=3.

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept 0.5000000000 B 0.14821413 3.37 0.0010

x1 1 0.3879750199 B 0.23428608 1.66 0.1005

x1 2 -.2299495084 B 0.22891720 -1.00 0.3173

x1 3 0.1000000000 B 0.19885012 0.50 0.6160

x1 4 0.0000000000 B . . .

x2 1 -.0381211799 B 0.16953676 -0.22 0.8225

x2 2 -.2618788201 B 0.16953676 -1.54 0.1252

x2 3 0.0000000000 B . . .

x2 4 0.0000000000 B . . .

If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.

I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.

Any better thoughts, statistically?

Doc_Duke · Posted 08-17-2011 11:17 PM

Correct. You can see it more clearly with continuous variables. If you have a model with, say, systolic blood pressure, diastolic blood pressure and pulse pressure (= (systolic - diastolic) ), then any two variables carry the same infomation as the third. You can actually write it out

y = int + b1*systolic + b2*diastolic + b3*pulse

= int + b1*systolic + b2*diastolic + b3*(systolic - diastolic)

= int + b1*systolic + b2*diastolic + b3*systolic - b3*diastolic

=int + (b1+b3)*systolic + (b2-b3)*diastolic

bncoxuk · Posted 08-18-2011 03:45 AM

Hi Duke, your example gives pulse pressue as being perfectly explained by systolic and diastolic: pulse pressure (= (systolic - diastolic) )

But in my data, the relationship is not that tight: not the whole variable is perfectly explained, but only one of its 4 levels.

I assume the principle is the same.

Reeza · Posted 08-19-2011 12:33 PM

For regression with categorical variables essentially dummy variables get created for n-1 levels with the nth level being all 0.

So if one level correlates perfectly with another level in a different categorical variable, the two are linear combinations of each other, same as Doc@Duke example.

Typically you can change the way you categorize the variables to help solve this one, but in your case with all missing for a particular population it might be tricky. You also have systematic missing data, which is a problem in of itself.