Better statistical analyses or issue of multicollinearity?

joebacon · Posted 02-01-2019 12:47 PM

Hi all,

I am at a standstill in a project I am working on and was hoping that someone who has dealt with this kind of situation could help point me in the right direction.

I have a large dataset of timeseries data which contains daily responses from one year before they decided to quit drinking and one year after about alcohol intake and the desire to do so from 5 different studies all put together. I also have data on what these individuals spent their money on broken down into categories.

My problem is that I am trying to prove that the allocations of money predicts what their drinking status will be at one year after resolution. I have 14 categories of expenditure variables. To do this, I ran 14 separate ANOVAs which almost all came back insignificant. I checked the distributions of said variables and they are VERY skewed. So, I took the log(Post +1)/log(Pre+1) to try to normalize the data and repeated to no avail. There were three different drinking statuses. I changed this to two and tried again, nothing. Then, I tried to run logistic regressions while adding covariates to see if that could account for some of the issue. The global was not significant and only 1 term (which i know predicts drinking status) was significant for each.

From here, I tried re-categorizing the expenditure variables based on a correlation matrix and repeating all the steps to no avail. There was one global regression that was significant but only one term was significant. All of the ANOVAs came back with nothing.

I was wondering if I am going about this incorrectly conceptually because I have good reason to believe based on past studies, some behavioral economics, and some anecdotal experience that the changes in allocation of funds should predict drinking status. I am wondering if maybe the categories are too non normal or there is an issue of multicollinearity.

I am more than willing to provide any extra details, but was hoping someone could shed some light since there are some very bright minded statisticians on this forum!

I apologize if this is in the wrong community and thank you in advance.

PaigeMiller · Posted 02-01-2019 12:54 PM

So its not completely clear what you did.

The part where you ran 14 different ANOVAs means there's no multicollinearity there. Then you say "I checked the distributions of said variables" ... which variables? Can you be more specific? Where does the ANOVA actually come from? ANOVA requires categorical X variables, but you say they are expenditures, which are not categorical? Did you check the distributions of the Y variables or the X variables? Have you accounted for time-series autocorrelation? Are you running models with all 14 expenditure variables in the model at the same time?

--
Paige Miller

joebacon · Posted 02-01-2019 01:06 PM

Hi Paige!

I checked the distributions of the expenditure variables which are EXTREMELY skewed.

The ANOVAS were ran on the drink status by each of the 14 expenditure variables. You are correct in saying that the expenditure variables are not categorical.

There are only 3 (or 2 if i change it) levels to the drink status variable there are 235 in level 1, 113 in level 2, and 64 in level 3.

I am not sure what time-series autocorrelation is, so, no i have not.

I have ran a model with all of the expenditure variables in the model which was globally significant but the individual terms were not.

PaigeMiller · Posted 02-01-2019 02:23 PM

@joebacon wrote:

Hi Paige!

I checked the distributions of the expenditure variables which are EXTREMELY skewed.

it is the distribution of the residuals of the Y variables that has to be normally distributed in an ANOVA or regression. There is no requirement that the distribution of the X-variables have any particular distribution.

There are only 3 (or 2 if i change it) levels to the drink status variable there are 235 in level 1, 113 in level 2, and 64 in level 3.

So you are saying your Y variable is categorical and not continuous, as it would be if you were doing a regression or ANOVA. With categorical Y, you probably need logistic regression, and again, there is no requirement that the X-variables have any particular distribution.

I have ran a model with all of the expenditure variables in the model which was globally significant but the individual terms were not.

And at this point, you probably do have multi-collinearity among your X-variables, which can affect the statistical significance levels of the X-variable slopes. My preferred solution to handle modeling of categorical Y-variables with multicollinearity in the X-variables is Logistic Partial Least Squares regression, which sadly is not something available in SAS at this time (it is available in R).

As a suggestion for this type of discussion in the future, it would help if you clearly indicate which variables are X-variables and which variable(s) are Y-variables, which are continuous and which are categorical.

--
Paige Miller

joebacon · Posted 02-01-2019 03:09 PM

The thing is, the variables changed. So, expenditure variables were the X variables in the logistic regression.

However, they were the Y variables in the ANOVA.

I will check out a logistic partial least squares though!

I apologize for not being clear, but thank you for the response!

Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

Re: Better statistical analyses or issue of multicollinearity?

SAS Innovate 2025: Register Now