BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi All,

      I am pretty new to modeling, I am struck with over 900 Interval input variables. I need some ideas to reduce them. Is there any way to find correlation between these variables so that redundancy can be handled? I am using Miner and Guide.

1 ACCEPTED SOLUTION

Accepted Solutions
gergely_batho
SAS Employee

Hi Minal,

SAS is able to calculate the correlation matrix of those 800 variable. But as you already noted, it is quite hard to look at 800x800/2 coefficients manually.

One way to handle it to calculate the first K principal components (PCA Node in Enterprise Miner), and use them in a predictive model. PCA is based on the correlation matrix. Instead of the original variables you will have K factor scores.

PCA factors and factor scores are hard to interpret, because each factor is a mixture (linear combination) of all variables. With variable clustering you also get factors but each of them depends only on some of the variables. Variable clustering is also based on correlations. It iteratively calculates PCA on the original variables (and on linear combinations of variables).

You can keep 1 variable from each cluster (as you described), or you can keep a linear combination of the variables in the cluster. The former is more interpretable, the latter is more “precise” in some sense.

Gergely

View solution in original post

7 REPLIES 7
M_Maldonado
Barite | Level 11

Hey Minal,

Enterprise Miner is really easy to learn. Use the reference help a lot (press F1 on your keyboard) and google for white papers on the most common Analytics problems you are trying to solve.

Take a look at this thread where we list a good number of ways to do variable selection: .

Good luck!

Miguel

MinalMMurkhande
Calcite | Level 5

Hi Miguel,

    Thank you for your reply! I had been waiting on someone to answer. But, given a set of census data(400 interval variable)  and financial data(another 400 interval variable) how do I find the correlation within these variables ? Will correlation as the first screening help ? or should I directly start on with decision tree / GBM models?

I can calculate spearman and hoeffding coefficients and also the VIF factor, but all that comes later once I run the model. How do I start of with initial screening? It would be very good if I could screen them using pearson correlation statistic... , but it would give me a matrix with 400 rows and 400 columns Smiley Sad

Also, I did Variable clustering. I select one variable which has the least 1-R^2 in each cluster. Would that work either?

I don't now if I am thinking in the right way. Any help would be appreciated.

Thanks,

Minal

gergely_batho
SAS Employee

Hi,

You didn't mention the purpose of your study. Prediction? What is your target variable?

If census data are predictors, and financial variables are the target, then try PLS.

Yes, Variable Clustering is a good tool for explanatory analysis or for dimensionality reduction.

You can also do a PCA or variable selection (node). Or you can use some of the modeling nodes (tree, forest, regression, LAR/LASSO, PLS, etc.) to select useful variables.
Gergely

Message was edited by: Gergely Bathó

MinalMMurkhande
Calcite | Level 5

Hello,

       I have a target variable which is binary. Both census data and finance data are predictors. Each have close to 400 variables so the total number of variables that I need to reduce is 800. Variable Clustering did help. But I was looking for more accurate solutions like correlation etc. Is there no way in which I can find the correlation between these variables? Also, would finding correlation for so many variables work?

Thanks,

Minal

gergely_batho
SAS Employee

Hi Minal,

SAS is able to calculate the correlation matrix of those 800 variable. But as you already noted, it is quite hard to look at 800x800/2 coefficients manually.

One way to handle it to calculate the first K principal components (PCA Node in Enterprise Miner), and use them in a predictive model. PCA is based on the correlation matrix. Instead of the original variables you will have K factor scores.

PCA factors and factor scores are hard to interpret, because each factor is a mixture (linear combination) of all variables. With variable clustering you also get factors but each of them depends only on some of the variables. Variable clustering is also based on correlations. It iteratively calculates PCA on the original variables (and on linear combinations of variables).

You can keep 1 variable from each cluster (as you described), or you can keep a linear combination of the variables in the cluster. The former is more interpretable, the latter is more “precise” in some sense.

Gergely

MinalMMurkhande
Calcite | Level 5


Thank you Gergely !

M_Maldonado
Barite | Level 11

@Gergely, thanks, that is some solid advice!

@Minal, if you are interested on calculating the VIF, here is one way to approach it: .

Thanks,

Miguel

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1983 views
  • 0 likes
  • 3 in conversation