Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Oversample and Score classification example

Reply
Occasional Contributor
Posts: 6

Oversample and Score classification example

Enterprise miner 14.1

Hello,

I am following this example https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-model-a-rare-target-using-an-overs... to familiarize myself with Oversampling.   As an additional learning, I connected a score node to the model comparison node. My thought is to copy the original data set and the first sample and score this data set.  So, I added set a copy of the original German Credit with a role of score and copied the first sample node (same seed, same sample size, and same event percent .05/.95) and ran the workflow.  

Class Variable Summary Statistics

Data Role=SCORE Output Type=CLASSIFICATION

Numeric Formatted Frequency
Variable Value Value Count Percent

I_good_bad . BAD     204 34
I_good_bad . GOOD 396 66


Data Role=SCORE Output Type=MODELDECISION

Numeric Formatted Frequency
Variable Value Value Count Percent

D_good_bad . BAD     226 37.6667
D_good_bad . GOOD 374 62.3333

 

I had expected the results to be closer to the sample proportions (Bad .05 vs Good . 95), but the results appear close to the original data set.  When I look at the score code, I see the original data set's posterior probabilities with no adjustment.

Label P_good_badgood='Predicted: good_bad=good';
P_good_badgood = 0.7;
Label P_good_badbad='Predicted: good_bad=bad';
P_good_badbad = 0.3; 

 

Am I just approaching this problem incorrectly? Have I made an error or just an error in understanding? I've attached a copy of my workflow, I renamed it .jpg.  If you drop this you should be able to import into EM.  Thanks!

 

Attachment
Occasional Contributor
Posts: 6

Re: Oversample and Score classification example

I took a deeper dive into the example listed above and I realize there are many inputs that affect the score percentages. The change I had questioned below, the scoring percentages being closer to the original data set percentages, was the effect of the sample proportion.   I adjusted the data partition percentages from Train/validate 50/50 to 70/30 and noticed the change in the model. This change, in turn, affected the scoring proportions. I also did see the updated prior probabilities in the SAS score code node.   In short, it was doing what it was supposed to do, and I learned a bit.  Any suggestions on topics to follow up on from here?

Ask a Question
Discussion stats
  • 1 reply
  • 197 views
  • 0 likes
  • 1 in conversation