BookmarkSubscribeRSS Feed
Chapi
Obsidian | Level 7

Hello, 

I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.

 

But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.

 

Many thanks!!

2 REPLIES 2
StatDave
SAS Super FREQ

Use the BEST=n option in PROC CORR, where n is the number of largest correlations to show for each variable. So, with BEST=2 you will get a table showing just the largest two correlations for each variable. If that is still too much to look at, you can use and ODS OUTPUT statement to save that table in a data set and you can then process that data set in any way you like to further reduce the number of correlations to examine.

PaigeMiller
Diamond | Level 26

@Chapi wrote:

Hello, 

I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.

 

But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.


You could also choose an analysis method that is extremely robust to highly correlated variables, and skip this step of deleting variables entirely. One such analysis method is Partial Least Squares (PROC PLS), which can take 500 highly correlated variables and build useful predictive models. Spectroscopy is a common application of PLS in which large numbers of highly correlated variables are input into a predictive model. Read an introduction about it here: https://support.sas.com/rnd/app/stat/papers/pls.pdf in which Randall Tobias says:

 

Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear.

 

P.S. ignore the SAS code in that paper, the syntax has changed since then.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1247 views
  • 1 like
  • 3 in conversation