BookmarkSubscribeRSS Feed
fredkho
Calcite | Level 5

Hello,

I have a dataset containing 1 dependant variable and 180 independant variable from which I should choose the 4 best possible independant variables.

What would be the best way to do such a thing?

Thanks,

8 REPLIES 8
PaigeMiller
Diamond | Level 26

To be honest, I would try a different approach. I've gone through this, my view has evolved from picking a small number of variables, to saying that doesn't make sense and using a different approach in which all variables are used in the model.

The problem is that there are many different definitions of "best possible independent variables". Also, forcing yourself to only 4 may or may not be harmful, and furthermore if your independent variables are correlated with one another, then least squares fitting and picking a few variables may not be the best approach at all.

I would use PROC PLS and fit a model that includes all independent variables,and which accounts for correlation between the independent variables; and from there you could judge a small number of important variables (although that number may not be 4).

--
Paige Miller
SteveDenham
Jade | Level 19

I'll expand some on PaigeMiller's response.  It also very much depends on what your objectives are. Do you wish to describe the existing data, or use the derived equation to predict future data?  Your answer to that will very much determine what you might do.  Additionally, how many observations do you have?  Consider that with 180 variables, you have 180*179/2 = 16,110 pairwise correlations between those variables.  Unless you have roughly an order of magnitude larger number of observations, you are likely to have multicollinearity between predictors due strictly to sampling.

Are there any reasons to believe that all 180 variables have equal meaning in an interpretive sense?  If so, then PROC PLS may be your only hope.  If they do NOT all have equal meaning, then why include those that don't seem to have a sensible relationship with the response variable?  Just because they were measured does not necessarily mean that they are either explanatory or predictive.

Good luck.

Steve Denham

fredkho
Calcite | Level 5

Hello Steve,

Hello Paige,

Thank you very much for your amazing answers.

The number of observations are 462.

My independent variables are highly correlated since they are all items coming from financial statements, it would not make sense to keep them all since we would run into major multicollinearity.

My final model should have between 4 and 10 independent variables. More would not be meaningful or helpful in my opinion since the objective of my model is to be predictive.

What would be helpful is a function that would try all possible set of 4 independent variables (I have a total of 180 independent variable) and tell me which set is the most explanatory with the least amount of multicollinearity, heteroskedasticity and serial correlation.

I wonder how that could be done in fact.

Any ideas?

PaigeMiller
Diamond | Level 26

... it would not make sense to keep them all since we would run into major multicollinearity.

This is NOT true if you use PROC PLS. PLS was specifically designed to work in the presence of multicollinearity; and specifically designed to keep all variables in the model (although many will be of no practical importance). Least squares regression has known problems in this case.

My final model should have between 4 and 10 independent variables.

My problem here is that specifying in advance how many independent variables might lead to trouble, if for example, there's only 1 significant predictor, or if there are 16 significant predictors.

What would be helpful is a function that would try all possible set of 4 independent variables (I have a total of 180 independent variable) and tell me which set is the most explanatory with the least amount of multicollinearity, heteroskedasticity and serial correlation.

In addition to my previously stated misgivings, I don't think there is a way to do this in SAS, other than by writing your own MACRO or PROC IML code.

You could use one of the STEPWISE methods in PROC REG, and force the options START=4 and STOP=4 (I have never actually tried this, so I can't guarantee it will do what I think it will do), which will give you the best fitting 4 parameter model it can find in a sequential fashion (but does not iterate through all possible 4 parameter models), but as I said, the drawback is that ordinary least squares has known problems in the case of multicollinearity (and it doesn't try to minimize the multicollinearity or adjust for it in any way). It also assumes 4 is the right number, which it may not be. Which again leads me back to PROC PLS, which has none of these drawbacks

--
Paige Miller
SteveDenham
Jade | Level 19

Wait a minute--serial correlation?  Are some of these 180 "independent" variables really indicators of time dependent measurements?

How about a description of how the data were collected, and why you think you might need to worry about serial correlation?  There may exist a possiblity for data reduction here.

But in the end, model building of this sort depends a lot on "prior art."  Someone decided that 180 variables were important at some point, or else they would not have been collected.  Someone decided to look at all of these, rather than a reasonable subset that they had an interest in, or knew from previous experience were associated with the response variable.

And so on to PROC PLS (provided there is not a time component of particular interest).  If you are truly interested in reducing variables, then once the eigenvector loadings are examined, you should be able to select those that are most influential in predicting your outcome.  PROC PLS also enables you to cross validate the results.

But I still want to know why the concern about serial correlation.  It may lead to an entirely different approach.

Steve Denham

fredkho
Calcite | Level 5

Thanks to both of you for these great answers.

I think there is no major concern about serial correlation here, since the data comes from financial statements, the main concern is multicollinearity I think.

I will let you know how it goes later.

Best,

Reeza
Super User

Do you need variables you understand? What about principal component or factor analysis?

PaigeMiller
Diamond | Level 26

Do you need variables you understand? What about principal component or factor analysis?

Neither Principal Components nor Factor Analysis would be appropriate in a regression situation (even though you can certainly read about using those techniques for regression in the literature). It is more logical to use (here we go again) PLS, which creates vectors of independent variables that are designed to be highly predictive of the dependent variable. PCA and FA create vectors with no regard to the relationship to the dependent variable.

--
Paige Miller

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1009 views
  • 0 likes
  • 4 in conversation