SAS Programming

Phil_NZ · Posted 04-28-2021 06:30 AM

Hi all SAS Experts,

When regressing nonexperimental data, I faced the Simpson Paradox. Simple description from this post:

Imagine

You run a linear regression with four numeric predictors (IV1, ..., IV4)

When only IV1 is included as a predictor the standardised beta is+.20

When you also include IV2 to IV4 the sign of the standardised regression coefficient of IV1 flips to -.25 (i.e., it's become negative)

In my case, when I run the linear regression:

y=b1x1+b2x2+b3x3.

b1 get the positive value. However, when I add the variable x4 to the regression, b1 becomes negative.

I look at a comment of this topic,one of the reason maybe due to the multicollinearity among variables.

I am wondering how to check whether Simpson's paradox in my case is caused by multicollinearity? How to generate the multicollinearity table among variables using SAS?

Many thanks and warm regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

PaigeMiller · Posted 04-29-2021 07:01 AM

ABSORB allows the model to be fit. You cannot get predicted values if you use the ABSORB statement. You cannot get least squares means of the ABSORBed variables. ABSORB does not help you understand multicollinearity, and it does not allow you to reduce multicollinearity.

With regard to the fact that you have >39,000 companies, I am still skeptical that having a variable with >39,000 levels in your model is truly valuable. But I don't know your exact problem as well as you know it. Only you can make that decision. If it was me at this point, I would try to combine companies by some criteria, such as Manufacturing, Financial, Retail, ... (you can determine what set of categories makes best sense to you) or by the first three digits of the SIC code, and see if that produces a good fit, and see if the multicollinearity is low. I suspect with >39,000 levels, you will always have high multicollinearity, but again, I don't have your data.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 04-28-2021 06:38 AM

Multicollinearity can be checked via the VIF calculations in PROC REG, high values of VIF indicate multicollinearity is a serious problem. (The issue of what counts as "high values of VIF" isn't that clear, some people say high is >10 others say high is >5, and I have also seen >3)

Example: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_reg_details24.htm

--
Paige Miller

Phil_NZ · Posted 04-28-2021 07:05 AM

Hi @PaigeMiller

Thank you for your reply and the SAS document. When reading the document, there are two things popping up on my head:

1. I am wondering why they have the VIF but they still show the Tolerance

2. So, VIF shows the multicollinearity between this variable and all other regressors, am I correct? For example, in the Table above, 1.59 is the multicollinearity between RunTime and all other independent variables?

Many thanks and warm regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

PaigeMiller · Posted 04-28-2021 07:09 AM

Some people prefer to look at Tolerance, others look at VIF. Doesn't really matter.
Yes, 1.59 for RunTime indicates the multicollinearity is not having a major impact on the estimate of the RunTime coefficient. Opposite is true for RunPulse, big effect of multicollinearity.

--
Paige Miller

Ksharp · Posted 04-28-2021 08:41 AM

That is normal and not really due to multicolinearity. When you add another variable into model the coefficient of v1 would change , it is all depend on data and GLM . So nothing to worry about . MODEL is build on data . If your data was not real good , Paradox would happen.

I also talk to @Rick_SAS about this topic (coefficient is changed). He also has the same opinion with me .

PaigeMiller · Posted 04-28-2021 08:55 AM

@Ksharp wrote:

That is normal and not really due to multicolinearity. When you add another variable into model the coefficient of v1 would change , it is all depend on data and GLM . So nothing to worry about . MODEL is build on data . If your data was not real good , Paradox would happen.

I also talk to @Rick_SAS about this topic (coefficient is changed). He also has the same opinion with me .

If there is no multicollinearity, and all variables are orthogonal to each other, the coefficient of x1 will not change when another variable is added to the model. So the only explanation that I know of for coefficients changing is (partial) multicollinearity. What is another explanation for coefficients changing (and signs of the coefficient changing as well)?

I think it is something to worry about, especially if the coefficients change sign when variables are added.

--
Paige Miller

Ksharp · Posted 04-28-2021 09:14 AM

"and all variables are orthogonal to each other, "
In real world , any two variables have correlation . You will NOT be able to find orthogonal variables .

"if the coefficients change sign when variables are added."
It is all about model and data.
For any kind of statistic model , when you add one more variable, the variance of model would be assign to this variable (more or less). it is inevitable .

PaigeMiller · Posted 04-28-2021 09:36 AM

@Ksharp wrote:
"and all variables are orthogonal to each other, "
In real world , any two variables have correlation . You will NOT be able to find orthogonal variables .

If we are discussing the math, you can have orthogonal variables and understand the effect of adding another variable. If the variables are orthogonal, adding a variable does not cause the other coefficients to change. If the variables are not orthogonal, the effect is that adding another variable causes coefficients to change, and the only reason I know is multicollinearity.

In the real world, you can have orthogonal variables in a designed experiment.

"if the coefficients change sign when variables are added."
It is all about model and data.
For any kind of statistic model , when you add one more variable, the variance of model would be assign to this variable (more or less). it is inevitable .

Okay, I agree with the above if you don't have orthogonal variables. But earlier you said it is "nothing to worry about" and you said it is "not really due to multicolinearity", those are the parts I disagree with. In my opinion, it is multicollinearity, and it is something to worry about.

--
Paige Miller

Ksharp · Posted 04-28-2021 09:46 AM

So how do you judge multicollinearity? rho=0.3 or rho=o.8 ? event with rho=0.1 , the coefficient would also be change into another sign. Sometimes with good data , the coefficient would not change, sometimes would be .It is depend on your data , it is not the fault of GLM .

Phil_NZ · Posted 04-28-2021 06:26 PM

Hi @PaigeMiller and @Ksharp

Thank you so much for the discussion so far. I really and truly appreciate this. I try my best to deal with my data to see what really going on?

First, I run the regression by putting variable by variable to see which variable cause the reversion in sign (I ran it on Stata because Stata provide me the tool to arrange the aesthetic table quickly by using est sto and esttab code.

As can be seen, the coefficients of my variable of interest pt changed its sign right at column 8. So, it seems that wCAP_INT wGRO_MAR are the two caused the sign reversion. So following @PaigeMiller 's suggestion, I get the VIF based on the code

proc reg data=merge_treat_con_copy;
	model wINV_DAY= pt wFIRM_SIZE LNGDP UNEMPLOYMENT  INFLATION 
		wTANGIBILITY FCF wLOG_MVE wCAP_INT wGRO_MAR/ tol vif collin;
run;

And the result is as below:

However, in my regression above, I control for firm (variable Type) and year (variable yr) fixed effect, but I do not know how to put the firm and year fixed effect to proc reg above. I normally know we can put it in proc glm

proc glm data=merge_treat_con_copy;
    class type yr;
    model wINV_DAY= pt wFIRM_SIZE LNGDP UNEMPLOYMENT  INFLATION 
		wTANGIBILITY FCF wLOG_MVE wCAP_INT wGRO_MAR type yr/solution ss3;
run;
quit;

But I cannot run this regression because the log shows that:

ERROR: Number of levels for some effects > 32767.

Following this discussion, @Reeza said that it may be because my Type variable having more than 32K value (and it is true). But I have not yet known about the absorb yet.

So, can you please tell me how to get the vif and else in proc glm

Many thanks and warm regards,

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

PaigeMiller · Posted 04-29-2021 06:25 AM

So, can you please tell me how to get the vif and else in proc glm

I doubt there is real value in having >32767 levels of a variable. So, my advice is to re-think your design, or combine some of the levels such that there are much fewer than 32767 distinct levels.

--
Paige Miller

Phil_NZ · Posted 04-29-2021 06:51 AM

Hi @PaigeMiller

Thank you very much for your help so far.

I think it is reasonable in my case regarding having >32767 distinct companies. Because I am dealing with all firms in 64 countries from 1990 to 2019, so it makes sense to me (I also compared to some international studies). I just sum up the total number of distinct firms in my sample and it is up to 39,049 firms. So when you control for firm fixed effect, there would be 39,049 - 1 =39048 levels, which is higher than 32767.

It seems that it is the time to consider the absorb option? Does it make sense to you, @PaigeMiller ?

Warm regards and thanks.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

PaigeMiller · Posted 04-29-2021 07:01 AM

ABSORB allows the model to be fit. You cannot get predicted values if you use the ABSORB statement. You cannot get least squares means of the ABSORBed variables. ABSORB does not help you understand multicollinearity, and it does not allow you to reduce multicollinearity.

With regard to the fact that you have >39,000 companies, I am still skeptical that having a variable with >39,000 levels in your model is truly valuable. But I don't know your exact problem as well as you know it. Only you can make that decision. If it was me at this point, I would try to combine companies by some criteria, such as Manufacturing, Financial, Retail, ... (you can determine what set of categories makes best sense to you) or by the first three digits of the SIC code, and see if that produces a good fit, and see if the multicollinearity is low. I suspect with >39,000 levels, you will always have high multicollinearity, but again, I don't have your data.

--
Paige Miller

SAS Programming

Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: How to get the multicolinearity in SAS to deal with Simpson Paradox?

Re: How to get the multicolinearity in SAS to deal with Simpson Paradox?

Re: How to get the multicolinearity in SAS to deal with Simpson Paradox?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Re: Getting multicolinearity dealing with Simpson Paradox using SAS?

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...