<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Correlation analysis on large dataset with 500 variables in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/Correlation-analysis-on-large-dataset-with-500-variables/m-p/701849#M33873</link>
    <description>&lt;P&gt;Use the BEST=n option in PROC CORR, where n is the number of largest correlations to show for each variable. So, with BEST=2 you will get a table showing just the largest two correlations for each variable. If that is still too much to look at, you can use and ODS OUTPUT statement to save that table in a data set and you can then process that data set in any way you like to further reduce the number of correlations to examine.&lt;/P&gt;</description>
    <pubDate>Thu, 26 Nov 2020 16:05:04 GMT</pubDate>
    <dc:creator>StatDave</dc:creator>
    <dc:date>2020-11-26T16:05:04Z</dc:date>
    <item>
      <title>Correlation analysis on large dataset with 500 variables</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Correlation-analysis-on-large-dataset-with-500-variables/m-p/701846#M33872</link>
      <description>&lt;P&gt;Hello,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Many thanks!!&lt;/P&gt;</description>
      <pubDate>Thu, 26 Nov 2020 15:55:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Correlation-analysis-on-large-dataset-with-500-variables/m-p/701846#M33872</guid>
      <dc:creator>Chapi</dc:creator>
      <dc:date>2020-11-26T15:55:34Z</dc:date>
    </item>
    <item>
      <title>Re: Correlation analysis on large dataset with 500 variables</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Correlation-analysis-on-large-dataset-with-500-variables/m-p/701849#M33873</link>
      <description>&lt;P&gt;Use the BEST=n option in PROC CORR, where n is the number of largest correlations to show for each variable. So, with BEST=2 you will get a table showing just the largest two correlations for each variable. If that is still too much to look at, you can use and ODS OUTPUT statement to save that table in a data set and you can then process that data set in any way you like to further reduce the number of correlations to examine.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Nov 2020 16:05:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Correlation-analysis-on-large-dataset-with-500-variables/m-p/701849#M33873</guid>
      <dc:creator>StatDave</dc:creator>
      <dc:date>2020-11-26T16:05:04Z</dc:date>
    </item>
    <item>
      <title>Re: Correlation analysis on large dataset with 500 variables</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Correlation-analysis-on-large-dataset-with-500-variables/m-p/701930#M33884</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/343121"&gt;@Chapi&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Hello,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have a large dataset with 500 plus variables. I would like to check the correlation between the independent variables and drop any of the variable that is highly correlated with each other.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But it has become a huge task to eyeball the table and find the high correlated variables. Is there an easy way to identify the highly correlated variables. I would later decide which one to drop based on the high predictive value.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;You could also choose an analysis method that is extremely robust to highly correlated variables, and skip this step of deleting variables entirely. One such analysis method is Partial Least Squares (PROC PLS), which can take 500 highly correlated variables and build useful predictive models. Spectroscopy is a common application of PLS in which large numbers of highly correlated variables are input into a predictive model. Read an introduction about it here: &lt;A href="https://support.sas.com/rnd/app/stat/papers/pls.pdf" target="_blank" rel="noopener"&gt;https://support.sas.com/rnd/app/stat/papers/pls.pdf&lt;/A&gt; in which Randall Tobias says:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="left: 540px; top: 294.641px; font-size: 16.9761px; font-family: sans-serif; transform: scaleX(0.979679);"&gt;Partial least squares &lt;/SPAN&gt;&lt;SPAN style="left: 695.2px; top: 295.012px; font-size: 16.605px; font-family: sans-serif; transform: scaleX(1.00466);"&gt;(PLS) is a method for construct&lt;/SPAN&gt;&lt;SPAN style="left: 539.993px; top: 315.021px; font-size: 16.605px; font-family: sans-serif; transform: scaleX(1.00339);"&gt;ing predictive models when the factors are many and &lt;/SPAN&gt;&lt;SPAN style="left: 539.993px; top: 335.03px; font-size: 16.605px; font-family: sans-serif; transform: scaleX(0.964507);"&gt;highly collinear. &lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="left: 539.993px; top: 335.03px; font-size: 16.605px; font-family: sans-serif; transform: scaleX(0.964507);"&gt;P.S. ignore the SAS code in that paper, the syntax has changed since then.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Nov 2020 22:24:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Correlation-analysis-on-large-dataset-with-500-variables/m-p/701930#M33884</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2020-11-26T22:24:02Z</dc:date>
    </item>
  </channel>
</rss>

