Help using Base SAS procedures

The effect of one variable's level cannot be estimated if...

Reply
Frequent Contributor
Posts: 131

The effect of one variable's level cannot be estimated if...

I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:

data work.data;

          infile datalines;

          input y x1 x2 @@;

          datalines;

          0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

          0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

          0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

          1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

          1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

          0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

          0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

          0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

          1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

          1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

run;

You can see whenever x1=4, then x2 also eqals to 4.

Let's run some logistic regression:

proc logistic data=work.data;

class x1 x2 /param=glm;

model y=x1 x2 /aggregate scale=none;

run;

Results:

                                                       Standard          Wald

               Parameter   DF    Estimate       Error    Chi-Square    Pr > ChiSq

               Intercept      1    4.82E-15     0.7071        0.0000       1.0000

               x1        1     1     -1.6953      1.1763        2.0770        0.1495

               x1        2     1      1.4020      1.1598        1.4612        0.2267

               x1        3     1     -0.4055      0.9574        0.1793        0.6719

               x1        4     0           0           .         .             .

               x2        1     1     -0.0799      0.8535        0.0088        0.9254

               x2        2     1      1.3451      0.8710        2.3847        0.1225

               x2        3     0           0           .         .             .

               x2        4     0           0           .         .             .

You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.

Let's run a linear regression model using PROC GLM:

proc glm data=work.data;

  class x1 x2;

  model y=x1 x2/solution ss3;

run;

quit;

The results are the same as below. There is no estimate for x2=3.

                                                       Standard

               Parameter           Estimate             Error    t Value    Pr > |t|

               Intercept       0.5000000000 B      0.14821413       3.37      0.0010

               x1        1      0.3879750199 B      0.23428608       1.66      0.1005

               x1        2       -.2299495084 B      0.22891720      -1.00      0.3173

               x1        3      0.1000000000 B      0.19885012       0.50      0.6160

               x1        4      0.0000000000 B       .                .         .

               x2        1      -.0381211799 B      0.16953676      -0.22      0.8225

               x2        2      -.2618788201 B      0.16953676      -1.54      0.1252

               x2        3     0.0000000000 B       .                .         .

               x2        4     0.0000000000 B       .                .         .

If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.

I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.

Any better thoughts, statistically?

Trusted Advisor
Posts: 2,113

The effect of one variable's level cannot be estimated if...

Correct.  You can see it more clearly with continuous variables.  If you have a model with, say, systolic blood pressure, diastolic blood pressure and pulse pressure (= (systolic - diastolic) ), then any two variables carry the same infomation as the third.  You can actually write it out

y = int + b1*systolic + b2*diastolic + b3*pulse

  = int + b1*systolic + b2*diastolic + b3*(systolic - diastolic)

  = int + b1*systolic + b2*diastolic + b3*systolic - b3*diastolic

  =int + (b1+b3)*systolic + (b2-b3)*diastolic

Frequent Contributor
Posts: 131

The effect of one variable's level cannot be estimated if...

Hi Duke, your example gives pulse pressue as being perfectly explained by systolic and diastolic:  pulse pressure (= (systolic - diastolic) )

But in my data, the relationship is not that tight: not the whole variable is perfectly explained, but only one of its 4 levels.

I assume the principle is the same.

Super User
Posts: 17,912

The effect of one variable's level cannot be estimated if...

For regression with categorical variables essentially dummy variables get created for n-1 levels with the nth level being all 0.

So if one level correlates perfectly with another level in a different categorical variable, the two are linear combinations of each other, same as Doc@Duke example.

Typically you can change the way you categorize the variables to help solve this one, but in your case with all missing for a particular population it might be tricky. You also have systematic missing data, which is a problem in of itself.

Ask a Question
Discussion stats
  • 3 replies
  • 184 views
  • 3 likes
  • 3 in conversation