Oversample and Score classification example

jlh368 · Posted 07-26-2017 02:19 PM

Enterprise miner 14.1

Hello,

I am following this example https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-model-a-rare-target-using-an-overs... to familiarize myself with Oversampling. As an additional learning, I connected a score node to the model comparison node. My thought is to copy the original data set and the first sample and score this data set. So, I added set a copy of the original German Credit with a role of score and copied the first sample node (same seed, same sample size, and same event percent .05/.95) and ran the workflow.

Class Variable Summary Statistics

Data Role=SCORE Output Type=CLASSIFICATION

Numeric Formatted Frequency
Variable Value Value Count Percent

I_good_bad . BAD 204 34
I_good_bad . GOOD 396 66

Data Role=SCORE Output Type=MODELDECISION

Numeric Formatted Frequency
Variable Value Value Count Percent

D_good_bad . BAD 226 37.6667
D_good_bad . GOOD 374 62.3333

I had expected the results to be closer to the sample proportions (Bad .05 vs Good . 95), but the results appear close to the original data set. When I look at the score code, I see the original data set's posterior probabilities with no adjustment.

Label P_good_badgood='Predicted: good_bad=good';
P_good_badgood = 0.7;
Label P_good_badbad='Predicted: good_bad=bad';
P_good_badbad = 0.3;

Am I just approaching this problem incorrectly? Have I made an error or just an error in understanding? I've attached a copy of my workflow, I renamed it .jpg. If you drop this you should be able to import into EM. Thanks!

jlh368 · Posted 08-02-2017 05:07 PM

I took a deeper dive into the example listed above and I realize there are many inputs that affect the score percentages. The change I had questioned below, the scoring percentages being closer to the original data set percentages, was the effect of the sample proportion. I adjusted the data partition percentages from Train/validate 50/50 to 70/30 and noticed the change in the model. This change, in turn, affected the scoring proportions. I also did see the updated prior probabilities in the SAS score code node. In short, it was doing what it was supposed to do, and I learned a bit. Any suggestions on topics to follow up on from here?

Oversample and Score classification example

Re: Oversample and Score classification example

SAS Innovate 2025: Save the Date