Statistical Procedures

Chapi · Posted 11-26-2020 10:55 AM

Hello,

I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.

But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.

Many thanks!!

StatDave · Posted 11-26-2020 11:05 AM

Use the BEST=n option in PROC CORR, where n is the number of largest correlations to show for each variable. So, with BEST=2 you will get a table showing just the largest two correlations for each variable. If that is still too much to look at, you can use and ODS OUTPUT statement to save that table in a data set and you can then process that data set in any way you like to further reduce the number of correlations to examine.

PaigeMiller · Posted 11-26-2020 05:09 PM

@Chapi wrote:

Hello,

I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.

But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.

You could also choose an analysis method that is extremely robust to highly correlated variables, and skip this step of deleting variables entirely. One such analysis method is Partial Least Squares (PROC PLS), which can take 500 highly correlated variables and build useful predictive models. Spectroscopy is a common application of PLS in which large numbers of highly correlated variables are input into a predictive model. Read an introduction about it here: https://support.sas.com/rnd/app/stat/papers/pls.pdf in which Randall Tobias says:

Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear.

P.S. ignore the SAS code in that paper, the syntax has changed since then.

--
Paige Miller

Statistical Procedures

Correlation analysis on large dataset with 500 variables

Re: Correlation analysis on large dataset with 500 variables

Re: Correlation analysis on large dataset with 500 variables

[SAS University Edition] 산점도(Scatter Plot), 상관분석(Correlation Analysis)

Correlate variable from two dataset

SAS Merge of Large Datasets - CDC WONDER

Admissions Analysts - Ethical Data Analysis

User-friendly SAS application: mixed model analysis, prediction and mo...

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...