BookmarkSubscribeRSS Feed
kurofufu
Calcite | Level 5

I create a indicator variable X1 to set 0 for group A and 1 for group B and then run proc reg on X1 along with other continuous X variables and both intercept and X1's coefficient are significant. But if I code it as 1 for group A and 0 for group B, then intercept becomes not significant. So what happens here? How to explain this?

13 REPLIES 13
PaigeMiller
Diamond | Level 26

When parameterization of your model changes, the meaning of your parameters changes, and thus the statistical significance can change as well.

So in the first case, the intercept has the meaning "what is the Y when all your continuous variables are 0 at group A" and in the second case, the intercept has the meaning "what is the Y when all your continuous variables are zero at group B".

--
Paige Miller
kurofufu
Calcite | Level 5

Use regression equations to explain my question

First case:  Y= b0 + b1 * X1 + ....

2nd case:   Y = c0 + c1 * X1 + ...

b0, b1 are significant

c0 is not significant, c1 is signficant

So which one I should use? Which one is correct?

PaigeMiller
Diamond | Level 26

They are both correct! As I already explained. The interecepts b0 and c0 are not measuring the same thing

Your equations leave out the term that accounts for the main effect of changing from group a to group b or vice versa

So first case is really Y = b0 + b1*(group=B) + b2*x1 + b3*x2 + ...

and c0 = b0 + b1*(group=B) and b0 = c0 + c1 * (group=A) <=== c0 is not equal to b0, they are to be interpreted differently, they measure different things

--
Paige Miller
kurofufu
Calcite | Level 5

I understand c0 not equal to b0 and two equations are equivalent agebraically. But since c0 is not significant, how can we adopt the second equation?

kurofufu
Calcite | Level 5

Don't you need to report significance information when presenting a regression equation?

PaigeMiller
Diamond | Level 26

Yes, the equations are equivalent. The parts of the equation are not equivalent.

But since c0 is not significant, how can we adopt the second equation?

It's just as valid as the first equation. You continue to confuse the validity of the equation, with the meaning of individual terms.

It is up to you to understand how to interpret it properly. Perhaps instead of reporting intercepts, which is causing this confusion, you should be reporting the value, and the statistical significance, of the delta between group A and group B, which I think is simply c0-b0. That seems like a better quantity to report.

--
Paige Miller
kurofufu
Calcite | Level 5

Sir, we are creating a regression equation for prediction, not for comparison of two groups.

SteveDenham
Jade | Level 19

The way you have stated your problem, you are very much comparing two groups.  I believe a clearer statement of your objectives is needed, as it is very obvious you are missing PaigeMiller's point, which seems perfectly obvious to me.  Your parameterization of the indicator variables means that the two groups will have different intercept-like terms (overall intercept plus intercept due to group).  Consequently, it is not at all surprising that the results are significant in one case, and not in the other. See PaigeMiller's response:

They are both correct! As I already explained. The interecepts b0 and c0 are not measuring the same thing

Your equations leave out the term that accounts for the main effect of changing from group a to group b or vice versa

So first case is really Y = b0 + b1*(group=B) + b2*x1 + b3*x2 + ...

and c0 = b0 + b1*(group=B) and b0 = c0 + c1 * (group=A) <=== c0 is not equal to b0, they are to be interpreted differently, they measure different things

So, you need to think along the following: Are the responses in the two groups parallel--thus the equation would differ only in the intercept for the two groups?  Or is there an interaction between group and the other predictor variables?  In this case, I would strongly recommend using one of the SAS procedures which has a CLASS option for your regression, such as GLM, MIXED, GENMOD, GLIMMIX, and not using indicator variables.

Steve Denham

kurofufu
Calcite | Level 5

GLM also produces the same result as REG.

kurofufu
Calcite | Level 5

Ok, let's focus on predictive modeling for this question.

When we create a regression model for prediction, don't all coefficients included in the model need to be significant?

PaigeMiller
Diamond | Level 26

With regards to the Intercept(s), I would say "No". Leave them in the model, even if they are not statistically significant. (I expect others to disagree with this, but that is my position on the matter)

You might want to read "Analysis of Messy Data, Volume 1, Designed Experiments" by Milliken and Johnson. Even though yours is not a designed experiment, they talk about relevant issues in Chapter 9. In fact, they speak of the "Means Model", which is a distinctly different parameterization than the model you get through SAS. In the "Means Model", all these issues go away. There is a distinct coefficient for the intercept of Group A, and a distinct coefficient for the intercept of Group B. And then, it doesn't matter whether you set A to be 0 and B to be 1, or the other way around.

--
Paige Miller
kurofufu
Calcite | Level 5

thanks for the great answer, pagemiller.

PaigeMiller
Diamond | Level 26

Sir, we are creating a regression equation for prediction, not for comparison of two groups.

Okay, then why the concern about the different intercepts? As you said, the models are equivalent. Either will give you the same predicted values.

As I have pointed out, and now as Steve seems to be pointing out, you can create models for prediction, or you can create models for understanding the individual terms (or both). Do NOT confuse the two. If you want a predictive model, then you choose either, and you report the Overall F as its level of significance. If you want to understand the individual terms, you report the tests of the individual model coefficients, with appropriate interpretation. (and of course you can do both)

You keep wandering back and forth between obtaining predictive model, and obtaining understanding of the individual terms.

--
Paige Miller

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 2350 views
  • 0 likes
  • 3 in conversation