<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Random Forest Overfitting in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Random-Forest-Overfitting/m-p/814883#M9188</link>
    <description>&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have built a random forest in SAS Miner for classification task. I have the variable Target (1=event, 0= non event) and i came along with top 20 variables more important. After that, i chose just this 20 and run again HPForest node, and all my metrics are ok between train (split 80%) and test (split 20%) but cumulative % captured response is significantly different between train (~30% in 1st decile) and test (~20% in 1st decile). I found that changing some parameters like mtry and maximum number of trees changes these results but is there a way i can find which are the optimal parameters? Trying different combinations by hand is not easy and I am not able to achieve good results.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I used already this methodology:&amp;nbsp;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Tip-Getting-the-Most-from-your-Random-Forest/ta-p/223949" target="_blank"&gt;Tip: Getting the Most from your Random Forest - SAS Support Communities&lt;/A&gt;&amp;nbsp;but first it only considers interval inputs and i have interval and categorical ones and also, i cannot achieve better results with this approach...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
    <pubDate>Tue, 24 May 2022 14:50:46 GMT</pubDate>
    <dc:creator>msf2021</dc:creator>
    <dc:date>2022-05-24T14:50:46Z</dc:date>
    <item>
      <title>Random Forest Overfitting</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Random-Forest-Overfitting/m-p/814883#M9188</link>
      <description>&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have built a random forest in SAS Miner for classification task. I have the variable Target (1=event, 0= non event) and i came along with top 20 variables more important. After that, i chose just this 20 and run again HPForest node, and all my metrics are ok between train (split 80%) and test (split 20%) but cumulative % captured response is significantly different between train (~30% in 1st decile) and test (~20% in 1st decile). I found that changing some parameters like mtry and maximum number of trees changes these results but is there a way i can find which are the optimal parameters? Trying different combinations by hand is not easy and I am not able to achieve good results.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I used already this methodology:&amp;nbsp;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Tip-Getting-the-Most-from-your-Random-Forest/ta-p/223949" target="_blank"&gt;Tip: Getting the Most from your Random Forest - SAS Support Communities&lt;/A&gt;&amp;nbsp;but first it only considers interval inputs and i have interval and categorical ones and also, i cannot achieve better results with this approach...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 24 May 2022 14:50:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Random-Forest-Overfitting/m-p/814883#M9188</guid>
      <dc:creator>msf2021</dc:creator>
      <dc:date>2022-05-24T14:50:46Z</dc:date>
    </item>
    <item>
      <title>Re: Random Forest Overfitting</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Random-Forest-Overfitting/m-p/814962#M9189</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/381594"&gt;@msf2021&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What is the variable importance table / importance plot telling you?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Maybe the top 20 variables are only responsible for 50% of the total importance?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can also have a look here :&lt;/P&gt;
&lt;P&gt;SAS Tutorial | How to train forest models in SAS?&lt;BR /&gt;&lt;A href="https://www.youtube.com/watch?v=FWragzNF59U" target="_blank"&gt;https://www.youtube.com/watch?v=FWragzNF59U&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;SAS Tutorial | How to Pick Hyperparameters of Machine Learning Models?&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.youtube.com/watch?v=AOR7XnCB_JA" target="_blank"&gt;https://www.youtube.com/watch?v=AOR7XnCB_JA&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can also select the most important variables upfront with other techniques.&lt;/P&gt;
&lt;P&gt;Not sure if the &lt;EM&gt;PROC VARREDUCE&amp;nbsp;&lt;/EM&gt;was already available in Enterprise Miner times(?).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;Koen&lt;/P&gt;</description>
      <pubDate>Tue, 24 May 2022 20:39:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Random-Forest-Overfitting/m-p/814962#M9189</guid>
      <dc:creator>sbxkoenk</dc:creator>
      <dc:date>2022-05-24T20:39:02Z</dc:date>
    </item>
  </channel>
</rss>

