Interpreting Multivariate Linear Regression with Categorical Variables

sassy_lm · Posted 09-24-2019 12:07 PM

Hi everyone,

I am running a multivariate linear regression with two categorical variables and five continuous variables included in the model. I understand there may be an issue with having too many parameters in the model with a small sample size (n~60) but I am working on removing some variables.

That being said, I have two questions regarding the output below.

1) Statistically, why are all the reference groups equal to the intercept? I know they are not statistically significant in this model but if they were to be and if I were to interpret them, how do I explain that cat_variable1 and cat_variable2 reference groups have the same mean # for y?

2) Looking at the output for categorical variable 1, cat_variable1-1 is statistically significant (p-value=0.0243). When interpreting this we would say: on average, the difference between cat_variable 1-1 and the reference group is -30.7 units when controlling for... or even the average y for cat_variable1-1 is (23.25-30.7) when controlling for... However, clinically a negative average is not possible for cat_variable1-1. Can anyone explain this to me? Or am I interpreting this wrong?

Thank you in advance!

Code:

proc glm data= final;
class cat_variable1 (ref="0") cat_variable2 (ref="0");
model y =cat_variable1 x1 x2 x3 cat_variable2 x4 x5 / solution;
run;

Output:

Parameter	Estimate		Standard Error	t Value	Pr > \|t\|
Intercept	23.25202936	B	52.02075379	0.45	0.6568
categorical variable 1-1	-30.73950086	B	13.23227919	-2.32	0.0243
categorical variable 1-0 (reference group)	0.00000000	B	.	.	.
continuous variable 1 (x1)	0.22038763		0.66338254	0.33	0.7411
continuous variable 2 (x2)	-0.04452906		0.91766027	-0.05	0.9615
continuous variable 3 (x3)	10.07999861		13.84172795	0.73	0.4699
categorical variable 2-1	7.29589112	B	17.81543974	0.41	0.6839
categorical variable 2-2	-11.89832386	B	27.73216991	-0.43	0.6697
categorical variable 2-3	-37.00121469	B	31.06097733	-1.19	0.2392
categorical variable 2-0 (reference group)	0.00000000	B	.	.	.
continuous variable 4 (x4)	0.18614679		0.06868578	2.71	0.0092
continuous variable 5 (x5)	-2.83648236		4.03778253	-0.70	0.4856

Reeza · Posted 09-24-2019 12:17 PM

The rule of thumb is 25 obs per variable, so you should have 2 to 3 variables. You have 11 at this point, so roughly 6 per parameter. That will not be reliable.

In PROC GLM you did not specify the parameterization type, so the GLM parameterization was used which changes how you can interpret the coefficients. From your statements you're using referentional coding, but haven't implemented that. PROC GLM doesn't support REF coding unless you do it manually, but PROC GLMSELECT does support it.

By default, when you have a categorical variable your reference level becomes part of the intercept. This is part of the design of the models.

sassy_lm · Posted 09-25-2019 02:01 PM

Hi Reeza,

Thank you for replying so quickly! I've always heard different answers for the rule of thumb but I'll keep 25 obs per variable in mind! And I've never used GLMSELECT before but definitely can come in handy when selecting which variables to keep in the model.

PaigeMiller · Posted 09-24-2019 01:09 PM

If you conduct a small experiment, and you measure the heights of males and females in a certain population, let's say the average height of males in inches in 70 and the average height of females is 65, then each of these are correct representations of the results.

Intercept 65, males +5, females 0
Intercept 70, males 0, females -5
Intercept 67.5, males +2.5, females -2.5
Intercept 83, males -13, females -18
and an infinite number of other representations are correct

So, they are all correct and SAS arbitrarily picks one, where one level is zero, and the remaining levels are not zero.

This is why you see zeros in your output. But what you really really really really really really really really ought to be using is the LSMEANS command, and not the coefficients from the SOLUTION option. The coefficients from the SOLUTION option, as shown above, are not unique and will have zeros for at least one level of your categorical variable and are somewhat not intuitive, and of limited usefulness, as your questions indicate. What does the LSMEANS do? It produces the following result in the males/females example

Males 70, Females 65

which incorporates the intercept and the coefficients from the SOLUTION vector to give you a sensible and easily interpretable result. Further, LSMEANs will allow you to easily compare the LSmean at one level to the LSmean at another level, if desired.

Now for a more complicated experiment such as yours, the LSMEANS command computes "least-squares means", which is effectively (in layman's terms) the mean at each level of the categorical variables, adjusted for the other variables in the model and any possible imbalance in the design.

--
Paige Miller

sassy_lm · Posted 09-25-2019 02:25 PM

Thank you so much, Paige Miller! I appreciate you providing me with an example and a detailed explanation for why we see zeros in the output. I completely forgot about the LSMEANS option and will make sure to use it in my experiment! Thank you again!!

Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

Re: Interpreting Multivariate Linear Regression with Categorical Variables

The 2025 SAS Hackathon has begun!