Solved: Re: Variable reduction

MinalMMurkhande · Posted 02-12-2015 04:09 PM

Hi All,

I am pretty new to modeling, I am struck with over 900 Interval input variables. I need some ideas to reduce them. Is there any way to find correlation between these variables so that redundancy can be handled? I am using Miner and Guide.

gergely_batho · Posted 02-16-2015 07:48 AM

Hi Minal,

SAS is able to calculate the correlation matrix of those 800 variable. But as you already noted, it is quite hard to look at 800x800/2 coefficients manually.

One way to handle it to calculate the first K principal components (PCA Node in Enterprise Miner), and use them in a predictive model. PCA is based on the correlation matrix. Instead of the original variables you will have K factor scores.

PCA factors and factor scores are hard to interpret, because each factor is a mixture (linear combination) of all variables. With variable clustering you also get factors but each of them depends only on some of the variables. Variable clustering is also based on correlations. It iteratively calculates PCA on the original variables (and on linear combinations of variables).

You can keep 1 variable from each cluster (as you described), or you can keep a linear combination of the variables in the cluster. The former is more interpretable, the latter is more “precise” in some sense.

Gergely

View solution in original post

M_Maldonado · Posted 02-13-2015 10:14 AM

Hey Minal,

Enterprise Miner is really easy to learn. Use the reference help a lot (press F1 on your keyboard) and google for white papers on the most common Analytics problems you are trying to solve.

Take a look at this thread where we list a good number of ways to do variable selection: .

Good luck!

Miguel

MinalMMurkhande · Posted 02-15-2015 11:22 AM

Hi Miguel,

Thank you for your reply! I had been waiting on someone to answer. But, given a set of census data(400 interval variable) and financial data(another 400 interval variable) how do I find the correlation within these variables ? Will correlation as the first screening help ? or should I directly start on with decision tree / GBM models?

I can calculate spearman and hoeffding coefficients and also the VIF factor, but all that comes later once I run the model. How do I start of with initial screening? It would be very good if I could screen them using pearson correlation statistic... , but it would give me a matrix with 400 rows and 400 columns

Also, I did Variable clustering. I select one variable which has the least 1-R^2 in each cluster. Would that work either?

I don't now if I am thinking in the right way. Any help would be appreciated.

Thanks,

Minal

gergely_batho · Posted 02-15-2015 07:23 PM

Hi,

You didn't mention the purpose of your study. Prediction? What is your target variable?

If census data are predictors, and financial variables are the target, then try PLS.

Yes, Variable Clustering is a good tool for explanatory analysis or for dimensionality reduction.

You can also do a PCA or variable selection (node). Or you can use some of the modeling nodes (tree, forest, regression, LAR/LASSO, PLS, etc.) to select useful variables.
Gergely

Message was edited by: Gergely Bathó

MinalMMurkhande · Posted 02-15-2015 09:23 PM

Hello,

I have a target variable which is binary. Both census data and finance data are predictors. Each have close to 400 variables so the total number of variables that I need to reduce is 800. Variable Clustering did help. But I was looking for more accurate solutions like correlation etc. Is there no way in which I can find the correlation between these variables? Also, would finding correlation for so many variables work?

Thanks,

Minal

gergely_batho · Posted 02-16-2015 07:48 AM

Hi Minal,

SAS is able to calculate the correlation matrix of those 800 variable. But as you already noted, it is quite hard to look at 800x800/2 coefficients manually.

One way to handle it to calculate the first K principal components (PCA Node in Enterprise Miner), and use them in a predictive model. PCA is based on the correlation matrix. Instead of the original variables you will have K factor scores.

PCA factors and factor scores are hard to interpret, because each factor is a mixture (linear combination) of all variables. With variable clustering you also get factors but each of them depends only on some of the variables. Variable clustering is also based on correlations. It iteratively calculates PCA on the original variables (and on linear combinations of variables).

You can keep 1 variable from each cluster (as you described), or you can keep a linear combination of the variables in the cluster. The former is more interpretable, the latter is more “precise” in some sense.

Gergely

MinalMMurkhande · Posted 02-16-2015 09:38 AM

Thank you Gergely !

M_Maldonado · Posted 02-16-2015 09:44 AM

@Gergely, thanks, that is some solid advice!

@Minal, if you are interested on calculating the VIF, here is one way to approach it: .

Thanks,

Miguel

Catch up on SAS Innovate 2026