BookmarkSubscribeRSS Feed
sassy_lm
Fluorite | Level 6

Hi everyone,

 

I am running a multivariate linear regression with two categorical variables and five continuous variables included in the model. I understand there may be an issue with having too many parameters in the model with a small sample size (n~60) but I am working on removing some variables.

 

That being said, I have two questions regarding the output below.

1) Statistically, why are all the reference groups equal to the intercept? I know they are not statistically significant in this model but if they were to be and if I were to interpret them, how do I explain that cat_variable1 and cat_variable2 reference groups have the same mean # for y?

 

2) Looking at the output for categorical variable 1, cat_variable1-1 is statistically significant (p-value=0.0243). When interpreting this we would say: on average, the difference between cat_variable 1-1 and the reference group is -30.7 units when controlling for... or even the average y for cat_variable1-1 is (23.25-30.7) when controlling for... However, clinically a negative average is not possible for cat_variable1-1. Can anyone explain this to me? Or am I interpreting this wrong?

 

Thank you in advance!

 

Code: 

proc glm data= final;
class cat_variable1 (ref="0") cat_variable2 (ref="0");
model y =cat_variable1 x1 x2 x3 cat_variable2 x4 x5 / solution;
run;

 

Output:

Parameter

Estimate

 

Standard Error

t Value

Pr > |t|

Intercept

23.25202936

B

52.02075379

0.45

0.6568

categorical variable 1-1

-30.73950086

B

13.23227919

-2.32

0.0243

categorical variable 1-0 (reference group)

0.00000000

B

.

.

.

continuous variable 1 (x1)

0.22038763

 

0.66338254

0.33

0.7411

 

continuous variable 2 

(x2)

-0.04452906

 

0.91766027

-0.05

0.9615

 

continuous variable 3 

(x3)

10.07999861

 

13.84172795

0.73

0.4699

 

categorical variable 2-1

7.29589112

B

17.81543974

0.41

0.6839

 

categorical variable 2-2

-11.89832386

B

27.73216991

-0.43

0.6697

 

categorical variable 2-3

-37.00121469

B

31.06097733

-1.19

0.2392

 

categorical variable 2-0 (reference group)

0.00000000

B

.

.

.

 

continuous variable 4 

(x4)

0.18614679

 

0.06868578

2.71

0.0092

 

continuous variable 5 

(x5)

-2.83648236

 

4.03778253

-0.70

0.4856

4 REPLIES 4
Reeza
Super User
The rule of thumb is 25 obs per variable, so you should have 2 to 3 variables. You have 11 at this point, so roughly 6 per parameter. That will not be reliable.

In PROC GLM you did not specify the parameterization type, so the GLM parameterization was used which changes how you can interpret the coefficients. From your statements you're using referentional coding, but haven't implemented that. PROC GLM doesn't support REF coding unless you do it manually, but PROC GLMSELECT does support it.

By default, when you have a categorical variable your reference level becomes part of the intercept. This is part of the design of the models.

sassy_lm
Fluorite | Level 6
Hi Reeza,

Thank you for replying so quickly! I've always heard different answers for the rule of thumb but I'll keep 25 obs per variable in mind! And I've never used GLMSELECT before but definitely can come in handy when selecting which variables to keep in the model.
PaigeMiller
Diamond | Level 26

If you conduct a small experiment, and you measure the heights of males and females in a certain population, let's say the average height of males in inches in 70 and the average height of females is 65, then each of these are correct representations of the results.

 

  • Intercept 65, males +5, females 0
  • Intercept 70, males 0, females -5
  • Intercept 67.5, males +2.5, females -2.5
  • Intercept 83, males -13, females -18
  • and an infinite number of other representations are correct

So, they are all correct and SAS arbitrarily picks one, where one level is zero, and the remaining levels are not zero.

 

This is why you see zeros in your output. But what you really really really really really really really really ought to be using is the LSMEANS command, and not the coefficients from the SOLUTION option. The coefficients from the SOLUTION option, as shown above, are not unique and will have zeros for at least one level of your categorical variable and are somewhat not intuitive, and of limited usefulness, as your questions indicate. What does the LSMEANS do? It produces the following result in the males/females example

 

  • Males 70, Females 65

which incorporates the intercept and the coefficients from the SOLUTION vector to give you a sensible and easily interpretable result. Further, LSMEANs will allow you to easily compare the LSmean at one level to the LSmean at another level, if desired.

 

Now for a more complicated experiment such as yours, the LSMEANS command computes "least-squares means", which is effectively (in layman's terms) the mean at each level of the categorical variables, adjusted for the other variables in the model and any possible imbalance in the design. 

 

--
Paige Miller
sassy_lm
Fluorite | Level 6

Thank you so much, Paige Miller! I appreciate you providing me with an example and a detailed explanation for why we see zeros in the output. I completely forgot about the LSMEANS option and will make sure to use it in my experiment! Thank you again!!

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 2339 views
  • 9 likes
  • 3 in conversation