<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Variable importance in random forest in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Variable-importance-in-random-forest/m-p/407270#M6213</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The standard approach of&amp;nbsp; reducing variables is to run Forest several times,&amp;nbsp; eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse.&amp;nbsp; As Doug alluded, Forests can benefit from using many variable to create a complex model.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;-Padraic&lt;/P&gt;</description>
    <pubDate>Wed, 25 Oct 2017 13:48:25 GMT</pubDate>
    <dc:creator>PadraicGNeville</dc:creator>
    <dc:date>2017-10-25T13:48:25Z</dc:date>
    <item>
      <title>Variable importance in random forest</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Variable-importance-in-random-forest/m-p/406307#M6189</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;In order to predict a binary target variable, I trained a random forest with 84 explanatory variables (using 10 variables randomly selected in each split) on a training set composed of 8,500 observations.&lt;BR /&gt;&lt;BR /&gt;For practical reasons, I had to test the performance of the algorithm on a test set of 100,000 observations.&lt;BR /&gt;&lt;BR /&gt;Performances on the test set for a decision threshold of 0.6:&lt;BR /&gt;&lt;BR /&gt;Recall: 50%&lt;BR /&gt;Precision: 70%&lt;BR /&gt;&lt;BR /&gt;After that, I used the variable importance plot to select the most important variables. I trained the model with the top 20 most important variables.&lt;BR /&gt;&lt;BR /&gt;The performances on the test set dramatically decreased:&lt;BR /&gt;&lt;BR /&gt;Recall: 20%&lt;BR /&gt;Precision: 6%&lt;BR /&gt;&lt;BR /&gt;Does someone know a scientific explanation to this counterintuitive phenomena ?&lt;BR /&gt;&lt;BR /&gt;Thank you for your help,&lt;BR /&gt;Marco</description>
      <pubDate>Sat, 21 Oct 2017 20:18:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Variable-importance-in-random-forest/m-p/406307#M6189</guid>
      <dc:creator>mmaccora</dc:creator>
      <dc:date>2017-10-21T20:18:58Z</dc:date>
    </item>
    <item>
      <title>Re: Variable importance in random forest</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Variable-importance-in-random-forest/m-p/406552#M6203</link>
      <description>&lt;P&gt;A few thoughts&amp;nbsp;come to mind...&lt;/P&gt;
&lt;P&gt;&amp;nbsp; * Forests were designed to deal with massive numbers of variables and observations where the structure is unknown and investigating individual variables is temporally or computationally inefficient&lt;/P&gt;
&lt;P&gt;&amp;nbsp; * 8,500 observations and 84 variables --&amp;gt; not a lot of observations/variables for such a flexible modeling method which might make it very easy to overfit the data, particularly with a random forest which builds models on random subsets of observations using random subsets of variables as predictors&lt;/P&gt;
&lt;P&gt;&amp;nbsp; * Of the 84 variables used as predictors, selecting the 20 most important ignored 3/4 of the variables which seem to have assisted in making full use of the most important variables.&amp;nbsp; &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; *&amp;nbsp; Precision &amp;amp; Recall are fine but rely in part on the threshold you are using.&amp;nbsp; I would interested in knowing the distribution of differences in the probability of the event of interest between both models.&amp;nbsp; It is possible that the distribution of probabilities is relatively small even though Precision &amp;amp; Recall differences seem dramatic&lt;/P&gt;
&lt;P&gt;&amp;nbsp; * There could be great variability if the event was rare, but you are not looking at a large number of observations either way for such a flexible modeling strategy.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this helps!&lt;BR /&gt;Doug&lt;/P&gt;</description>
      <pubDate>Mon, 23 Oct 2017 14:38:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Variable-importance-in-random-forest/m-p/406552#M6203</guid>
      <dc:creator>DougWielenga</dc:creator>
      <dc:date>2017-10-23T14:38:30Z</dc:date>
    </item>
    <item>
      <title>Re: Variable importance in random forest</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Variable-importance-in-random-forest/m-p/407270#M6213</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The standard approach of&amp;nbsp; reducing variables is to run Forest several times,&amp;nbsp; eliminating a small number (such as 4) of least important variables after each run until fit statistics such as recall and precision start getting worse.&amp;nbsp; As Doug alluded, Forests can benefit from using many variable to create a complex model.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;-Padraic&lt;/P&gt;</description>
      <pubDate>Wed, 25 Oct 2017 13:48:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Variable-importance-in-random-forest/m-p/407270#M6213</guid>
      <dc:creator>PadraicGNeville</dc:creator>
      <dc:date>2017-10-25T13:48:25Z</dc:date>
    </item>
  </channel>
</rss>

