<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Handling imbalanced data in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978765#M11091</link>
    <description>&lt;P data-unlink="true"&gt;When you have lots of "good" and relatively few "bad", there is definitely the possibility that the variables you have does not predict the bads. In fact, this is a common situation. This is not necessarily your fault or the fault of the model, that's often the way it is. You can try oversampling (see &lt;A href="https://support.sas.com/kb/22/601.html" target="_self"&gt;here&lt;/A&gt;&amp;nbsp;and &lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-model-a-rare-target-using-an-oversample-approach-in/ta-p/223599" target="_self"&gt;here&lt;/A&gt;). You can also read a gazillion commentaries on oversampling, go to your favorite internet search engine and type in&amp;nbsp;&lt;/P&gt;
&lt;P data-unlink="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-unlink="true"&gt;statistical oversampling good vs bad&lt;/P&gt;</description>
    <pubDate>Wed, 12 Nov 2025 18:17:08 GMT</pubDate>
    <dc:creator>PaigeMiller</dc:creator>
    <dc:date>2025-11-12T18:17:08Z</dc:date>
    <item>
      <title>Handling imbalanced data</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978540#M11087</link>
      <description>&lt;P&gt;Greetings, I need your assistance with handling my heavily imbalanced dataset. I am predicting the probability of a student passing after being accepted into the school. As part of the application process, prospective students complete a survey that includes details such as their study periods, study habits, location, academic marks, and other related information. Using this data, I aim to predict the probability of failure. The issue is that we are working with historical data from 2021 to the present, and it is heavily imbalanced. In the training set, we have 1,465 students who failed and 58,744 who passed. My model is not performing well, as it fails to correctly predict students who are likely to fail at various thresholds (class_pred = 0.3 to 0.6). Could you please assist me in addressing this problem? I have tried oversampling, but I am unsure if this is the best approach. I also plan to experiment with techniques such as undersampling and SMOTE. I am currently working in SAS Enterprise Guide and also have access to Enterprise Miner.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Nov 2025 21:46:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978540#M11087</guid>
      <dc:creator>lukholoman</dc:creator>
      <dc:date>2025-11-07T21:46:17Z</dc:date>
    </item>
    <item>
      <title>Re: Handling imbalanced data</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978549#M11088</link>
      <description>&lt;P&gt;It's quite possible your model has poor predictive ability because failure in this case might be largely explained by factors your data does not contain, and there's really not anything that's going to fix that.&amp;nbsp; However, you might start by posting the parameter estimates and other model output from, e.g., a Cox model (PROC PHREG), assuming you have &lt;STRONG&gt;time to failure&lt;/STRONG&gt;.&amp;nbsp; Given the time period, I would also definitely try to incorporate something related to the pandemic, as the effect of that on academic success might vary quite a lot by place and over time.&amp;nbsp; &amp;nbsp; A more detailed list of the predictors you're using (and how they're captured -- categorical, continuous, etc.) would help us answer your question better.&lt;/P&gt;</description>
      <pubDate>Sat, 08 Nov 2025 00:44:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978549#M11088</guid>
      <dc:creator>quickbluefish</dc:creator>
      <dc:date>2025-11-08T00:44:31Z</dc:date>
    </item>
    <item>
      <title>Re: Handling imbalanced data</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978556#M11089</link>
      <description>Yeah. That is a big issue.&lt;BR /&gt;You could try Tree Based statistical method. &lt;BR /&gt;Like &lt;BR /&gt;decision tree:&lt;BR /&gt;PROC HPSPLIT &lt;BR /&gt;&lt;BR /&gt;random forest:&lt;BR /&gt;PROC HPFOREST&lt;BR /&gt;&lt;BR /&gt;and partial least square regression:&lt;BR /&gt;PROC PLS&lt;BR /&gt;&lt;BR /&gt;or try non-parameter version of logistic model:&lt;BR /&gt;&lt;A href="https://blogs.sas.com/content/iml/2016/03/23/nonparametric-regression-binary-response-sas.html" target="_blank"&gt;https://blogs.sas.com/content/iml/2016/03/23/nonparametric-regression-binary-response-sas.html&lt;/A&gt;</description>
      <pubDate>Sat, 08 Nov 2025 08:39:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978556#M11089</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2025-11-08T08:39:58Z</dc:date>
    </item>
    <item>
      <title>Re: Handling imbalanced data</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978756#M11090</link>
      <description>&lt;UL&gt;
&lt;LI&gt;undersampling the majority class in your binary classification model might be&amp;nbsp;worthwhile&lt;/LI&gt;
&lt;LI&gt;oversampling the minority class&amp;nbsp; in your binary classification model might be&amp;nbsp;worthwhile&amp;nbsp;&lt;BR /&gt;&lt;A href="https://support.sas.com/resources/papers/proceedings18/3604-2018.pdf" target="_blank" rel="noopener"&gt;MITIGATING THE EFFECTS OF CLASS IMBALANCE USING SMOTE&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;adding business features&amp;nbsp;might be&amp;nbsp;worthwhile!! (extra inputs or derived or composite inputs that are also relevant for explaining and predicting the target)&lt;/LI&gt;
&lt;LI&gt;calculating statistical and machine learning features&amp;nbsp;might be&amp;nbsp;worthwhile. &lt;BR /&gt;e.g. in Enterprise Miner there is a node for variable clustering. Cluster your variables and model with the 1st principal component of every cluster as inputs / candidate predictors.&lt;/LI&gt;
&lt;LI&gt;do not forget to adjust your posterior probabilities for the real priors. you can use the target profiler for this.&lt;BR /&gt;&lt;A href="https://support.sas.com/documentation/cdl/en/emxndg/67980/HTML/default/viewer.htm#p1vqpbjwoo4bv7n1sw77e0z64xxs.htm" target="_blank" rel="noopener"&gt;Prior Probabilities :: SAS(R) Enterprise Miner(TM) 14.1 Extension Nodes: Developer's Guide&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Search the best threshold that gives good balance between precision and recall (true positive rate). Or look at the F1-score.&lt;BR /&gt;&lt;A href="https://en.wikipedia.org/wiki/Confusion_matrix" target="_blank" rel="noopener"&gt;https://en.wikipedia.org/wiki/Confusion_matrix&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Ciao,&lt;/P&gt;
&lt;P&gt;Koen&lt;/P&gt;</description>
      <pubDate>Wed, 12 Nov 2025 16:57:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978756#M11090</guid>
      <dc:creator>sbxkoenk</dc:creator>
      <dc:date>2025-11-12T16:57:43Z</dc:date>
    </item>
    <item>
      <title>Re: Handling imbalanced data</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978765#M11091</link>
      <description>&lt;P data-unlink="true"&gt;When you have lots of "good" and relatively few "bad", there is definitely the possibility that the variables you have does not predict the bads. In fact, this is a common situation. This is not necessarily your fault or the fault of the model, that's often the way it is. You can try oversampling (see &lt;A href="https://support.sas.com/kb/22/601.html" target="_self"&gt;here&lt;/A&gt;&amp;nbsp;and &lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-model-a-rare-target-using-an-oversample-approach-in/ta-p/223599" target="_self"&gt;here&lt;/A&gt;). You can also read a gazillion commentaries on oversampling, go to your favorite internet search engine and type in&amp;nbsp;&lt;/P&gt;
&lt;P data-unlink="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-unlink="true"&gt;statistical oversampling good vs bad&lt;/P&gt;</description>
      <pubDate>Wed, 12 Nov 2025 18:17:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Handling-imbalanced-data/m-p/978765#M11091</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2025-11-12T18:17:08Z</dc:date>
    </item>
  </channel>
</rss>

