BookmarkSubscribeRSS Feed
bncoxuk
Obsidian | Level 7

I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:

data work.data;

          infile datalines;

          input y x1 x2 @@;

          datalines;

          0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

          0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

          0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

          1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

          1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

          0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

          0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

          0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

          1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

          1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

run;

You can see whenever x1=4, then x2 also eqals to 4.

Let's run some logistic regression:

proc logistic data=work.data;

class x1 x2 /param=glm;

model y=x1 x2 /aggregate scale=none;

run;

Results:

                                                       Standard          Wald

               Parameter   DF    Estimate       Error    Chi-Square    Pr > ChiSq

               Intercept      1    4.82E-15     0.7071        0.0000       1.0000

               x1        1     1     -1.6953      1.1763        2.0770        0.1495

               x1        2     1      1.4020      1.1598        1.4612        0.2267

               x1        3     1     -0.4055      0.9574        0.1793        0.6719

               x1        4     0           0           .         .             .

               x2        1     1     -0.0799      0.8535        0.0088        0.9254

               x2        2     1      1.3451      0.8710        2.3847        0.1225

               x2        3     0           0           .         .             .

               x2        4     0           0           .         .             .

You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.

Let's run a linear regression model using PROC GLM:

proc glm data=work.data;

  class x1 x2;

  model y=x1 x2/solution ss3;

run;

quit;

The results are the same as below. There is no estimate for x2=3.

                                                       Standard

               Parameter           Estimate             Error    t Value    Pr > |t|

               Intercept       0.5000000000 B      0.14821413       3.37      0.0010

               x1        1      0.3879750199 B      0.23428608       1.66      0.1005

               x1        2       -.2299495084 B      0.22891720      -1.00      0.3173

               x1        3      0.1000000000 B      0.19885012       0.50      0.6160

               x1        4      0.0000000000 B       .                .         .

               x2        1      -.0381211799 B      0.16953676      -0.22      0.8225

               x2        2      -.2618788201 B      0.16953676      -1.54      0.1252

               x2        3     0.0000000000 B       .                .         .

               x2        4     0.0000000000 B       .                .         .

If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.

I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.

Any better thoughts, statistically?

3 REPLIES 3
Doc_Duke
Rhodochrosite | Level 12

Correct.  You can see it more clearly with continuous variables.  If you have a model with, say, systolic blood pressure, diastolic blood pressure and pulse pressure (= (systolic - diastolic) ), then any two variables carry the same infomation as the third.  You can actually write it out

y = int + b1*systolic + b2*diastolic + b3*pulse

  = int + b1*systolic + b2*diastolic + b3*(systolic - diastolic)

  = int + b1*systolic + b2*diastolic + b3*systolic - b3*diastolic

  =int + (b1+b3)*systolic + (b2-b3)*diastolic

bncoxuk
Obsidian | Level 7

Hi Duke, your example gives pulse pressue as being perfectly explained by systolic and diastolic:  pulse pressure (= (systolic - diastolic) )

But in my data, the relationship is not that tight: not the whole variable is perfectly explained, but only one of its 4 levels.

I assume the principle is the same.

Reeza
Super User

For regression with categorical variables essentially dummy variables get created for n-1 levels with the nth level being all 0.

So if one level correlates perfectly with another level in a different categorical variable, the two are linear combinations of each other, same as Doc@Duke example.

Typically you can change the way you categorize the variables to help solve this one, but in your case with all missing for a particular population it might be tricky. You also have systematic missing data, which is a problem in of itself.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1604 views
  • 3 likes
  • 3 in conversation