Obsidian | Level 7

## Correlation analysis on large dataset with 500 variables

Hello,

I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.

But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.

Many thanks!!

2 REPLIES 2
SAS Super FREQ

## Re: Correlation analysis on large dataset with 500 variables

Use the BEST=n option in PROC CORR, where n is the number of largest correlations to show for each variable. So, with BEST=2 you will get a table showing just the largest two correlations for each variable. If that is still too much to look at, you can use and ODS OUTPUT statement to save that table in a data set and you can then process that data set in any way you like to further reduce the number of correlations to examine.

Diamond | Level 26

## Re: Correlation analysis on large dataset with 500 variables

@Chapi wrote:

Hello,

I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.

But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.

You could also choose an analysis method that is extremely robust to highly correlated variables, and skip this step of deleting variables entirely. One such analysis method is Partial Least Squares (PROC PLS), which can take 500 highly correlated variables and build useful predictive models. Spectroscopy is a common application of PLS in which large numbers of highly correlated variables are input into a predictive model. Read an introduction about it here: https://support.sas.com/rnd/app/stat/papers/pls.pdf in which Randall Tobias says:

Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear.

P.S. ignore the SAS code in that paper, the syntax has changed since then.

--
Paige Miller
Discussion stats
• 2 replies
• 1275 views
• 1 like
• 3 in conversation