Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Interpreting Multivariate Linear Regression with Categorical Variables

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-24-2019 12:07 PM
(2262 views)

Hi everyone,

I am running a multivariate linear regression with two categorical variables and five continuous variables included in the model. I understand there may be an issue with having too many parameters in the model with a small sample size (n~60) but I am working on removing some variables.

That being said, I have two questions regarding the output below.

1) Statistically, why are all the reference groups equal to the intercept? I know they are not statistically significant in this model but if they were to be and if I were to interpret them, how do I explain that cat_variable1 and cat_variable2 reference groups have the same mean # for y?

2) Looking at the output for categorical variable 1, cat_variable1-1 is statistically significant (p-value=0.0243). When interpreting this we would say: on average, the difference between cat_variable 1-1 and the reference group is -30.7 units when controlling for... or even the average y for cat_variable1-1 is (23.25-30.7) when controlling for... However, clinically a negative average is not possible for cat_variable1-1. Can anyone explain this to me? Or am I interpreting this wrong?

Thank you in advance!

**Code: **

proc glm data= final;

class cat_variable1 (ref="0") cat_variable2 (ref="0");

model y =cat_variable1 x1 x2 x3 cat_variable2 x4 x5 / solution;

run;

**Output:**

Parameter | Estimate | Standard Error | t Value | Pr > |t| | |

Intercept | 23.25202936 | B | 52.02075379 | 0.45 | 0.6568 |

categorical variable 1-1 | -30.73950086 | B | 13.23227919 | -2.32 | 0.0243 |

categorical variable 1-0 (reference group) | 0.00000000 | B | . | . | . |

continuous variable 1 (x1) | 0.22038763 | 0.66338254 | 0.33 | 0.7411 | |

continuous variable 2 (x2) | -0.04452906 | 0.91766027 | -0.05 | 0.9615 | |

continuous variable 3 (x3) | 10.07999861 | 13.84172795 | 0.73 | 0.4699 | |

categorical variable 2-1 | 7.29589112 | B | 17.81543974 | 0.41 | 0.6839 |

categorical variable 2-2 | -11.89832386 | B | 27.73216991 | -0.43 | 0.6697 |

categorical variable 2-3 | -37.00121469 | B | 31.06097733 | -1.19 | 0.2392 |

categorical variable 2-0 (reference group) | 0.00000000 | B | . | . | . |

continuous variable 4 (x4) | 0.18614679 | 0.06868578 | 2.71 | 0.0092 | |

continuous variable 5 (x5) | -2.83648236 | 4.03778253 | -0.70 | 0.4856 |

4 REPLIES 4

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The rule of thumb is 25 obs per variable, so you should have 2 to 3 variables. You have 11 at this point, so roughly 6 per parameter. That will not be reliable.

In PROC GLM you did not specify the parameterization type, so the GLM parameterization was used which changes how you can interpret the coefficients. From your statements you're using referentional coding, but haven't implemented that. PROC GLM doesn't support REF coding unless you do it manually, but PROC GLMSELECT does support it.

By default, when you have a categorical variable your reference level becomes part of the intercept. This is part of the design of the models.

In PROC GLM you did not specify the parameterization type, so the GLM parameterization was used which changes how you can interpret the coefficients. From your statements you're using referentional coding, but haven't implemented that. PROC GLM doesn't support REF coding unless you do it manually, but PROC GLMSELECT does support it.

By default, when you have a categorical variable your reference level becomes part of the intercept. This is part of the design of the models.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Reeza,

Thank you for replying so quickly! I've always heard different answers for the rule of thumb but I'll keep 25 obs per variable in mind! And I've never used GLMSELECT before but definitely can come in handy when selecting which variables to keep in the model.

Thank you for replying so quickly! I've always heard different answers for the rule of thumb but I'll keep 25 obs per variable in mind! And I've never used GLMSELECT before but definitely can come in handy when selecting which variables to keep in the model.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If you conduct a small experiment, and you measure the heights of males and females in a certain population, let's say the average height of males in inches in 70 and the average height of females is 65, then each of these are correct representations of the results.

- Intercept 65, males +5, females 0
- Intercept 70, males 0, females -5
- Intercept 67.5, males +2.5, females -2.5
- Intercept 83, males -13, females -18
- and an infinite number of other representations are correct

So, they are all correct and SAS arbitrarily picks one, where one level is zero, and the remaining levels are not zero.

This is why you see zeros in your output. But what you really really really really really really really really ought to be using is the LSMEANS command, and not the coefficients from the SOLUTION option. The coefficients from the SOLUTION option, as shown above, are not unique and will have zeros for at least one level of your categorical variable and are somewhat not intuitive, and of limited usefulness, as your questions indicate. What does the LSMEANS do? It produces the following result in the males/females example

- Males 70, Females 65

which incorporates the intercept and the coefficients from the SOLUTION vector to give you a sensible and easily interpretable result. Further, LSMEANs will allow you to easily compare the LSmean at one level to the LSmean at another level, if desired.

Now for a more complicated experiment such as yours, the LSMEANS command computes "least-squares means", which is effectively (in layman's terms) the mean at each level of the categorical variables, adjusted for the other variables in the model and any possible imbalance in the design.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.