Hi everyone,
I am running a multivariate linear regression with two categorical variables and five continuous variables included in the model. I understand there may be an issue with having too many parameters in the model with a small sample size (n~60) but I am working on removing some variables.
That being said, I have two questions regarding the output below.
1) Statistically, why are all the reference groups equal to the intercept? I know they are not statistically significant in this model but if they were to be and if I were to interpret them, how do I explain that cat_variable1 and cat_variable2 reference groups have the same mean # for y?
2) Looking at the output for categorical variable 1, cat_variable1-1 is statistically significant (p-value=0.0243). When interpreting this we would say: on average, the difference between cat_variable 1-1 and the reference group is -30.7 units when controlling for... or even the average y for cat_variable1-1 is (23.25-30.7) when controlling for... However, clinically a negative average is not possible for cat_variable1-1. Can anyone explain this to me? Or am I interpreting this wrong?
Thank you in advance!
Code:
proc glm data= final;
class cat_variable1 (ref="0") cat_variable2 (ref="0");
model y =cat_variable1 x1 x2 x3 cat_variable2 x4 x5 / solution;
run;
Output:
Parameter | Estimate | Standard Error | t Value | Pr > |t| | |
Intercept | 23.25202936 | B | 52.02075379 | 0.45 | 0.6568 |
categorical variable 1-1 | -30.73950086 | B | 13.23227919 | -2.32 | 0.0243 |
categorical variable 1-0 (reference group) | 0.00000000 | B | . | . | . |
continuous variable 1 (x1) | 0.22038763 | 0.66338254 | 0.33 | 0.7411 | |
continuous variable 2 (x2) | -0.04452906 | 0.91766027 | -0.05 | 0.9615 | |
continuous variable 3 (x3) | 10.07999861 | 13.84172795 | 0.73 | 0.4699 | |
categorical variable 2-1 | 7.29589112 | B | 17.81543974 | 0.41 | 0.6839 |
categorical variable 2-2 | -11.89832386 | B | 27.73216991 | -0.43 | 0.6697 |
categorical variable 2-3 | -37.00121469 | B | 31.06097733 | -1.19 | 0.2392 |
categorical variable 2-0 (reference group) | 0.00000000 | B | . | . | . |
continuous variable 4 (x4) | 0.18614679 | 0.06868578 | 2.71 | 0.0092 | |
continuous variable 5 (x5) | -2.83648236 | 4.03778253 | -0.70 | 0.4856 |
If you conduct a small experiment, and you measure the heights of males and females in a certain population, let's say the average height of males in inches in 70 and the average height of females is 65, then each of these are correct representations of the results.
So, they are all correct and SAS arbitrarily picks one, where one level is zero, and the remaining levels are not zero.
This is why you see zeros in your output. But what you really really really really really really really really ought to be using is the LSMEANS command, and not the coefficients from the SOLUTION option. The coefficients from the SOLUTION option, as shown above, are not unique and will have zeros for at least one level of your categorical variable and are somewhat not intuitive, and of limited usefulness, as your questions indicate. What does the LSMEANS do? It produces the following result in the males/females example
which incorporates the intercept and the coefficients from the SOLUTION vector to give you a sensible and easily interpretable result. Further, LSMEANs will allow you to easily compare the LSmean at one level to the LSmean at another level, if desired.
Now for a more complicated experiment such as yours, the LSMEANS command computes "least-squares means", which is effectively (in layman's terms) the mean at each level of the categorical variables, adjusted for the other variables in the model and any possible imbalance in the design.
Thank you so much, Paige Miller! I appreciate you providing me with an example and a detailed explanation for why we see zeros in the output. I completely forgot about the LSMEANS option and will make sure to use it in my experiment! Thank you again!!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.