BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Phil_NZ
Barite | Level 11

Hi all SAS Experts,

 

When regressing nonexperimental data, I faced the Simpson Paradox. Simple description from this post:

Imagine

  • You run a linear regression with four numeric predictors (IV1, ..., IV4)
  • When only IV1 is included as a predictor the standardised beta is+.20
  • When you also include IV2 to IV4 the sign of the standardised regression coefficient of IV1 flips to -.25 (i.e., it's become negative)

In my case, when I run the linear regression:

y=b1x1+b2x2+b3x3.

b1 get the positive value. However, when I add the variable x4 to the regression, b1 becomes negative.

I look at a comment of this topic,one of the reason maybe due to the multicollinearity among variables.

 

Phil_NZ_0-1619605616279.png

I am wondering how to check whether Simpson's paradox in my case is caused by multicollinearity? How to generate the multicollinearity table among variables using SAS?

Many thanks and warm regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

ABSORB allows the model to be fit. You cannot get predicted values if you use the ABSORB statement. You cannot get least squares means of the ABSORBed variables. ABSORB does not help you understand multicollinearity, and it does not allow you to reduce multicollinearity.

 

With regard to the fact that you have >39,000 companies, I am still skeptical that having a variable with >39,000 levels in your model is truly valuable. But I don't know your exact problem as well as you know it. Only you can make that decision. If it was me at this point, I would try to combine companies by some criteria, such as Manufacturing, Financial, Retail, ... (you can determine what set of categories makes best sense to you) or by the first three digits of the SIC code, and see if that produces a good fit, and see if the multicollinearity is low. I suspect with >39,000 levels, you will always have high multicollinearity, but again, I don't have your data.

--
Paige Miller

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

Multicollinearity can be checked via the VIF calculations in PROC REG, high values of VIF indicate multicollinearity is a serious problem. (The issue of what counts as "high values of VIF" isn't that clear, some people say high is >10 others say high is >5, and I have also seen >3)

 

Example: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_reg_details24.htm

--
Paige Miller
Phil_NZ
Barite | Level 11

Hi @PaigeMiller 

Thank you for your reply and the SAS document. When reading the document, there are two things popping up on my head:

1. I am wondering why they have the VIF but they still show the Tolerance 

Phil_NZ_0-1619607815423.png

2. So, VIF shows the multicollinearity between this variable and all other regressors, am I correct? For example, in the Table above, 1.59 is the multicollinearity between RunTime and all other independent variables?

 

Many thanks and warm regards.

 

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
PaigeMiller
Diamond | Level 26
  1. Some people prefer to look at Tolerance, others look at VIF. Doesn't really matter.
  2. Yes, 1.59 for RunTime indicates the multicollinearity is not having a major impact on the estimate of the RunTime coefficient. Opposite is true for RunPulse, big effect of multicollinearity.
--
Paige Miller
Ksharp
Super User

That is normal and not really due to multicolinearity. When you add another variable into model the coefficient of v1 would change , it is all depend on data and GLM . So nothing to worry about . MODEL is build on data . If your data was not real good , Paradox would happen.

I also talk to @Rick_SAS  about this topic (coefficient is changed). He also has the same opinion with me .

PaigeMiller
Diamond | Level 26

@Ksharp wrote:

That is normal and not really due to multicolinearity. When you add another variable into model the coefficient of v1 would change , it is all depend on data and GLM . So nothing to worry about . MODEL is build on data . If your data was not real good , Paradox would happen.

I also talk to @Rick_SAS  about this topic (coefficient is changed). He also has the same opinion with me .


If there is no multicollinearity, and all variables are orthogonal to each other, the coefficient of x1 will not change when another variable is added to the model. So the only explanation that I know of for coefficients changing is (partial) multicollinearity. What is another explanation for coefficients changing (and signs of the coefficient changing as well)?


I think it is something to worry about, especially if the coefficients change sign when variables are added.

--
Paige Miller
Ksharp
Super User
"and all variables are orthogonal to each other, "
In real world , any two variables have correlation . You will NOT be able to find orthogonal variables .

"if the coefficients change sign when variables are added."
It is all about model and data.
For any kind of statistic model , when you add one more variable, the variance of model would be assign to this variable (more or less). it is inevitable .
PaigeMiller
Diamond | Level 26

@Ksharp wrote:
"and all variables are orthogonal to each other, "
In real world , any two variables have correlation . You will NOT be able to find orthogonal variables .


If we are discussing the math, you can have orthogonal variables and understand the effect of adding another variable. If the variables are orthogonal, adding a variable does not cause the other coefficients to change. If the variables are not orthogonal, the effect is that adding another variable causes coefficients to change, and the only reason I know is multicollinearity.

 

In the real world, you can have orthogonal variables in a designed experiment.

 


"if the coefficients change sign when variables are added."
It is all about model and data.
For any kind of statistic model , when you add one more variable, the variance of model would be assign to this variable (more or less). it is inevitable .

 

Okay, I agree with the above if you don't have orthogonal variables. But earlier you said it is "nothing to worry about" and you said it is "not really due to multicolinearity", those are the parts I disagree with. In my opinion, it is multicollinearity, and it is something to worry about.

--
Paige Miller
Ksharp
Super User
So how do you judge multicollinearity? rho=0.3 or rho=o.8 ? event with rho=0.1 , the coefficient would also be change into another sign. Sometimes with good data , the coefficient would not change, sometimes would be .It is depend on your data , it is not the fault of GLM .
Phil_NZ
Barite | Level 11

Hi @PaigeMiller  and @Ksharp 

 

Thank you so much for the discussion so far. I really and truly appreciate this. I try my best to deal with my data to see what really going on?

First, I run the regression by putting variable by variable to see which variable cause the reversion in sign (I ran it on Stata because Stata provide me the tool to arrange the aesthetic table quickly by using est sto and esttab code.

Phil_NZ_1-1619647482061.png

 

As can be seen, the coefficients of my variable of interest pt changed its sign right at column 8. So, it seems that wCAP_INT wGRO_MAR are the two caused the sign reversion. So following @PaigeMiller 's suggestion, I get the VIF based on the code

proc reg data=merge_treat_con_copy;
	model wINV_DAY= pt wFIRM_SIZE LNGDP UNEMPLOYMENT  INFLATION 
		wTANGIBILITY FCF wLOG_MVE wCAP_INT wGRO_MAR/ tol vif collin;
run;

And the result is as below:

Phil_NZ_2-1619648011174.png

However, in my regression above, I control for firm (variable Type) and year (variable yr) fixed effect, but I do not know how to put the firm and year fixed effect to proc reg above. I normally know we can put it in proc glm

proc glm data=merge_treat_con_copy;
    class type yr;
    model wINV_DAY= pt wFIRM_SIZE LNGDP UNEMPLOYMENT  INFLATION 
		wTANGIBILITY FCF wLOG_MVE wCAP_INT wGRO_MAR type yr/solution ss3;
run;
quit;

But I cannot run this regression because the log shows that:

ERROR: Number of levels for some effects > 32767.

Following this discussion@Reeza  said that it may be because my Type variable having more than 32K value (and it is true). But I have not yet known about the absorb yet.

 

 

So, can you please tell me how to get the vif and else in proc glm

 

Many thanks and warm regards,

 

 

 

 

 

 

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
PaigeMiller
Diamond | Level 26

So, can you please tell me how to get the vif and else in proc glm

I doubt there is real value in having >32767 levels of a variable. So, my advice is to re-think your design, or combine some of the levels such that there are much fewer than 32767 distinct levels.

--
Paige Miller
Phil_NZ
Barite | Level 11

Hi @PaigeMiller 

Thank you very much for your help so far.

I think it is reasonable in my case regarding having >32767 distinct companies. Because I am dealing with all firms in 64 countries from 1990 to 2019, so it makes sense to me (I also compared to some international studies). I just sum up the total number of distinct firms in my sample and it is up to 39,049 firms. So when you control for firm fixed effect, there would be 39,049 - 1 =39048 levels, which is higher than 32767.

 

It seems that it is the time to consider the absorb option? Does it make sense to you, @PaigeMiller  ?

 

Warm regards and thanks.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
PaigeMiller
Diamond | Level 26

ABSORB allows the model to be fit. You cannot get predicted values if you use the ABSORB statement. You cannot get least squares means of the ABSORBed variables. ABSORB does not help you understand multicollinearity, and it does not allow you to reduce multicollinearity.

 

With regard to the fact that you have >39,000 companies, I am still skeptical that having a variable with >39,000 levels in your model is truly valuable. But I don't know your exact problem as well as you know it. Only you can make that decision. If it was me at this point, I would try to combine companies by some criteria, such as Manufacturing, Financial, Retail, ... (you can determine what set of categories makes best sense to you) or by the first three digits of the SIC code, and see if that produces a good fit, and see if the multicollinearity is low. I suspect with >39,000 levels, you will always have high multicollinearity, but again, I don't have your data.

--
Paige Miller

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1663 views
  • 9 likes
  • 3 in conversation