Programming the statistical procedures from SAS

Please look at a simple data and a tricky question

Reply
Frequent Contributor
Posts: 131

Please look at a simple data and a tricky question

I used the following LOGISTIC procedure to predict the variable group, but cannot understand the output. The output showed that both levels age='4' and age='X' have DF=0 and estimates as 0. While age=’A’ is understandably the reference group, the age=’4’ group is questionable. I checked the data using PROC FREQ and found there is so such problem as quasi-complete separation. Age and sex are not perfectly related. The only problem may be from the fact that:

IF sex is gay (sex=’G’), then age is definitely unknown (age=’X’).

Except this situation, there is no relationship between sex and age.

I tried again by combining age=’4’ into the group of age=’3’ (i.e. age’s levels are 1, 2, 3 and X), and ran the model again. This time, the level age=’3’ has the same problem with DF=0, Estimate=0.

PROC LOGISTIC DATA=work.data DESC;

       CLASS sex age;

       MODEL group=sex age /AGGREGATE SCALE=NONE; /*To model group=0 or 1*/

RUN;

             Analysis of Maximum Likelihood Estimates

                                            Standard          Wald

Parameter        DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept       1     -8.5254      0.0925     8502.1673        <.0001

sex     F       1      0.0686      0.0760        0.8144        0.3668

sex     G       1      0.1554      0.1115        1.9435        0.1633

sex     M       0           0           .         .             .

age     1       1      0.2709      0.1690        2.5692        0.1090

age     2       1      0.0275      0.1461        0.0354        0.8507

age     3       1     -0.2961      0.1285        5.3097        0.0212

age     4       0           0           .         .             .

age     X       0           0           .         .             .

PROC Star
Posts: 7,416

Please look at a simple data and a tricky question

While I'm not versed in logistic regression, I would presume that it follows the same logic as linear regression.

I found it interesting that you had 3 levels of sex .. what is in the 3rd level?  Is it male, female and missing?

Regardless, if logistic regression follows the same rules as linear regression then I would expect that you can only have degrees of freedom for n-1 levels.  Thus, if you do really have three levels of sex, you can only obtain parameters for 2 of them.

Similarly, if you only have 4 levels of age, you can only obtain parameters for 3 of them, as that would already account for all of the variance.

Frequent Contributor
Posts: 131

Please look at a simple data and a tricky question

Hi art297, thanks for your reply.

Age has 5 levels: 1, 2, 3, 4, and X. So I expect there are 4 estimates (the 5th is 0). But now there is only 3.

Super User
Posts: 18,528

Please look at a simple data and a tricky question

Art is correct, I don't know why you even have estimates with DF 0 in your output, with similar options as the code you provided SAS does not give me results like that...

Also, Sex=Gay? makes me question your entire model, but for now I'll assume what you're doing makes sense. I'd assume sexuality would be a separate variable, gay vs straight.

You may also want to specify your reference levels because having a reference level of Age='X' or unknown also makes no sense to me.

Frequent Contributor
Posts: 131

Please look at a simple data and a tricky question

Hi Reeza,

It is correct to have DF=0 for a category of a variable in logistic regression. This indicates that this category is the base level (reference group), and other levels are contrasting with this level.

I understand there is definitely a problem here. Need a statistican.

Super User
Posts: 18,528

Please look at a simple data and a tricky question

I understand what the DF=0 means, but the output you're displaying isn't default output given the code provided ..we could be using diff SAS versions I suppose. V9.2.3 here.

If you have access to a statistician at your site who can view the data then I'd definitely recommend doing that.

I'd be interested in hearing what the issue is in case I ever run into though, as a statistician Smiley Happy

Frequent Contributor
Posts: 131

Re: Please look at a simple data and a tricky question

I just did an experiment. It showed that one level of a variable cannot be estimated if this level has a perfect correlation with a level of another variables. Let's see the data below:

data work.data;

          infile datalines;

          input y x1 x2 @@;

          datalines;

          0 1 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

          0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

          0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

          1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

          1 3 3 1 3 2 1 3 3 1 3 2 1 3 2 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

          0 3 2 0 1 2 0 1 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1

          0 2 2 0 2 1 0 2 2 0 2 2 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 2

          0 2 2 0 2 2 0 2 2 0 3 3 0 3 3 0 3 2 0 3 2 0 3 1 0 3 1 0 3 1

          1 3 1 1 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1

          1 3 3 1 3 2 1 3 3 1 3 2 1 3 1 1 3 3 0 4 4 0 4 4 1 4 4 1 4 4

run;

You can see whenever x1=4, then x2 also eqals to 4.

Let's run some logistic regression:

proc logistic data=work.data;

class x1 x2 /param=glm;

model y=x1 x2 /aggregate scale=none;

run;

Results:

                                                       Standard          Wald

               Parameter   DF    Estimate       Error    Chi-Square    Pr > ChiSq

               Intercept      1    4.82E-15     0.7071        0.0000       1.0000

               x1        1     1     -1.6953      1.1763        2.0770        0.1495

               x1        2     1      1.4020      1.1598        1.4612        0.2267

               x1        3     1     -0.4055      0.9574        0.1793        0.6719

               x1        4     0           0           .         .             .

               x2        1     1     -0.0799      0.8535        0.0088        0.9254

               x2        2     1      1.3451      0.8710        2.3847        0.1225

               x2        3     0           0           .         .             .

               x2        4     0           0           .         .             .

You can see the results: x2=4 is the reference group, so DF=0. But x2=3 also has DF=0. Problem! This is because x2=4 is 100% correlated with x1=4. You can say in this case x2=3 acts as the reference group, and the effect of x2=4 has been explained by the other variable x1's x1=4.

Let's run a linear regression model using PROC GLM:

proc glm data=work.data;

  class x1 x2;

  model y=x1 x2/solution ss3;

run;

quit;

The results are the same as below. There is no estimate for x2=3.

                                                       Standard

               Parameter           Estimate             Error    t Value    Pr > |t|

               Intercept       0.5000000000 B      0.14821413       3.37      0.0010

               x1        1      0.3879750199 B      0.23428608       1.66      0.1005

               x1        2       -.2299495084 B      0.22891720      -1.00      0.3173

               x1        3     0.1000000000 B      0.19885012       0.50      0.6160

               x1        4     0.0000000000 B       .                .         .

               x2        1     -.0381211799 B      0.16953676      -0.22      0.8225

               x2        2     -.2618788201 B      0.16953676      -1.54      0.1252

               x2        3     0.0000000000 B       .                .         .

               x2        4     0.0000000000 B       .                .         .

If you change the data so that x1=4 and x2=4 are not perfectly related, then the problem disappeared.

I think the explanation can be that 'the effects of the levels can only be estimated unless they have unique contribution, i.e. they are not perfectly explained by other variables'.

Trusted Advisor
Posts: 2,114

Please look at a simple data and a tricky question

Your data are overparameterized.  See my response in

http://communities.sas.com/thread/30685?tstart=0


Doc Muhlbaier

Duke

Ask a Question
Discussion stats
  • 7 replies
  • 367 views
  • 0 likes
  • 4 in conversation