- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi there
I am trying to build a classifier with miner and my issue comes from unbalanced data. My dataset is made of 109,194 records, from which 1379 have a target=1 and the remaining 107815 have a target=0, leading to a 98.74%/1.26% ratio. My 30 predictors are all numeric.
I have tested three way to handle this unbalanced data: first one, I do no sample at all as per the following diagram
method1 (raw)
Second one I over sample the minority class 1 to represent about 30% of the dataset using the Sampling node (criterion property set a level-based)
method2 (over sampling)
Last one, I do not over sample but change the values in the diagonal in the Decision weight tabs form the Input Node option and put as a weight for the rare event the ratio of probability of common event / rare event, namely 98.74/1.26=78.36.
method3 (Decision Weights)
The results are as follow
Method1 results
Method2 results
Method3 results
I do not find the results tremendously convincing (and still confused as why false/true positive are non integer for method2). Am I doing anything wrong? I know there i a lot bout unbalanced data but I do not seem to find a way to apply any solution to my case. Thanks
Nicolas
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
Maybe this thread can help you while someone takes a second look into what you did?
When I oversample, I usually test the model on a hold-out test data set that I saved somewhere else and didn't use for modeling. That gives me some confidence that I didn't fool myself 🙂
Would that be an option for you?
Best,
-Miguel
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
Maybe this thread can help you while someone takes a second look into what you did?
When I oversample, I usually test the model on a hold-out test data set that I saved somewhere else and didn't use for modeling. That gives me some confidence that I didn't fool myself 🙂
Would that be an option for you?
Best,
-Miguel