I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:
data work.data;
infile datalines;
input y x1 x2 @@;
datalines;
0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1
0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2
0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1
1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1
1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4
0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1
0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2
0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1
1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1
1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4
run;
You can see whenever x1=4, then x2 also eqals to 4.
Let's run some logistic regression:
proc logistic data=work.data;
class x1 x2 /param=glm;
model y=x1 x2 /aggregate scale=none;
run;
Results:
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 4.82E-15 0.7071 0.0000 1.0000
x1 1 1 -1.6953 1.1763 2.0770 0.1495
x1 2 1 1.4020 1.1598 1.4612 0.2267
x1 3 1 -0.4055 0.9574 0.1793 0.6719
x1 4 0 0 . . .
x2 1 1 -0.0799 0.8535 0.0088 0.9254
x2 2 1 1.3451 0.8710 2.3847 0.1225
x2 3 0 0 . . .
x2 4 0 0 . . .
You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.
Let's run a linear regression model using PROC GLM:
proc glm data=work.data;
class x1 x2;
model y=x1 x2/solution ss3;
run;
quit;
The results are the same as below. There is no estimate for x2=3.
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 0.5000000000 B 0.14821413 3.37 0.0010
x1 1 0.3879750199 B 0.23428608 1.66 0.1005
x1 2 -.2299495084 B 0.22891720 -1.00 0.3173
x1 3 0.1000000000 B 0.19885012 0.50 0.6160
x1 4 0.0000000000 B . . .
x2 1 -.0381211799 B 0.16953676 -0.22 0.8225
x2 2 -.2618788201 B 0.16953676 -1.54 0.1252
x2 3 0.0000000000 B . . .
x2 4 0.0000000000 B . . .
If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.
I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.
Any better thoughts, statistically?
Correct. You can see it more clearly with continuous variables. If you have a model with, say, systolic blood pressure, diastolic blood pressure and pulse pressure (= (systolic - diastolic) ), then any two variables carry the same infomation as the third. You can actually write it out
y = int + b1*systolic + b2*diastolic + b3*pulse
= int + b1*systolic + b2*diastolic + b3*(systolic - diastolic)
= int + b1*systolic + b2*diastolic + b3*systolic - b3*diastolic
=int + (b1+b3)*systolic + (b2-b3)*diastolic
Hi Duke, your example gives pulse pressue as being perfectly explained by systolic and diastolic: pulse pressure (= (systolic - diastolic) )
But in my data, the relationship is not that tight: not the whole variable is perfectly explained, but only one of its 4 levels.
I assume the principle is the same.
For regression with categorical variables essentially dummy variables get created for n-1 levels with the nth level being all 0.
So if one level correlates perfectly with another level in a different categorical variable, the two are linear combinations of each other, same as Doc@Duke example.
Typically you can change the way you categorize the variables to help solve this one, but in your case with all missing for a particular population it might be tricky. You also have systematic missing data, which is a problem in of itself.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
