topic Variable reduction in SAS Data Science

Variable reduction

MinalMMurkhande — Thu, 12 Feb 2015 21:09:16 GMT

Hi All,

I am pretty new to modeling, I am struck with over 900 Interval input variables. I need some ideas to reduce them. Is there any way to find correlation between these variables so that redundancy can be handled? I am using Miner and Guide.

Re: Variable reduction

M_Maldonado — Fri, 13 Feb 2015 15:14:23 GMT

Hey Minal,

Enterprise Miner is really easy to learn. Use the reference help a lot (press F1 on your keyboard) and google for white papers on the most common Analytics problems you are trying to solve.

Take a look at this thread where we list a good number of ways to do variable selection: .

Good luck!

Miguel

Re: Variable reduction

MinalMMurkhande — Sun, 15 Feb 2015 16:22:04 GMT

Hi Miguel,

Thank you for your reply! I had been waiting on someone to answer. But, given a set of census data(400 interval variable) and financial data(another 400 interval variable) how do I find the correlation within these variables ? Will correlation as the first screening help ? or should I directly start on with decision tree / GBM models?

I can calculate spearman and hoeffding coefficients and also the VIF factor, but all that comes later once I run the model. How do I start of with initial screening? It would be very good if I could screen them using pearson correlation statistic... , but it would give me a matrix with 400 rows and 400 columns

Also, I did Variable clustering. I select one variable which has the least 1-R^2 in each cluster. Would that work either?

I don't now if I am thinking in the right way. Any help would be appreciated.

Thanks,

Minal

Re: Variable reduction

gergely_batho — Mon, 16 Feb 2015 00:23:29 GMT

Hi,

You didn't mention the purpose of your study. Prediction? What is your target variable?

If census data are predictors, and financial variables are the target, then try PLS.

Yes, Variable Clustering is a good tool for explanatory analysis or for dimensionality reduction.

You can also do a PCA or variable selection (node). Or you can use some of the modeling nodes (tree, forest, regression, LAR/LASSO, PLS, etc.) to select useful variables.
Gergely

Message was edited by: Gergely Bathó

Re: Variable reduction

MinalMMurkhande — Mon, 16 Feb 2015 02:23:49 GMT

Hello,

I have a target variable which is binary. Both census data and finance data are predictors. Each have close to 400 variables so the total number of variables that I need to reduce is 800. Variable Clustering did help. But I was looking for more accurate solutions like correlation etc. Is there no way in which I can find the correlation between these variables? Also, would finding correlation for so many variables work?

Thanks,

Minal

Re: Variable reduction

gergely_batho — Mon, 16 Feb 2015 12:48:02 GMT

Hi Minal,

SAS is able to calculate the correlation matrix of those 800 variable. But as you already noted, it is quite hard to look at 800x800/2 coefficients manually.

One way to handle it to calculate the first K principal components (PCA Node in Enterprise Miner), and use them in a predictive model. PCA is based on the correlation matrix. Instead of the original variables you will have K factor scores.

PCA factors and factor scores are hard to interpret, because each factor is a mixture (linear combination) of all variables. With variable clustering you also get factors but each of them depends only on some of the variables. Variable clustering is also based on correlations. It iteratively calculates PCA on the original variables (and on linear combinations of variables).

You can keep 1 variable from each cluster (as you described), or you can keep a linear combination of the variables in the cluster. The former is more interpretable, the latter is more “precise” in some sense.

Gergely

Re: Variable reduction

MinalMMurkhande — Mon, 16 Feb 2015 14:38:39 GMT

Thank you Gergely !

Re: Variable reduction

M_Maldonado — Mon, 16 Feb 2015 14:44:15 GMT

@Gergely, thanks, that is some solid advice!

@Minal, if you are interested on calculating the VIF, here is one way to approach it: .

Thanks,

Miguel