Hi there
I am trying to build a classifier with miner and my issue comes from unbalanced data. My dataset is made of 109,194 records, from which 1379 have a target=1 and the remaining 107815 have a target=0, leading to a 98.74%/1.26% ratio. My 30 predictors are all numeric.
I have tested three way to handle this unbalanced data: first one, I do no sample at all as per the following diagram
method1 (raw)
Second one I over sample the minority class 1 to represent about 30% of the dataset using the Sampling node (criterion property set a level-based)
method2 (over sampling)
Last one, I do not over sample but change the values in the diagonal in the Decision weight tabs form the Input Node option and put as a weight for the rare event the ratio of probability of common event / rare event, namely 98.74/1.26=78.36.
method3 (Decision Weights)
The results are as follow
Method1 results
Method2 results
Method3 results
I do not find the results tremendously convincing (and still confused as why false/true positive are non integer for method2). Am I doing anything wrong? I know there i a lot bout unbalanced data but I do not seem to find a way to apply any solution to my case. Thanks
Nicolas
Hi Nicolas,
Maybe this thread can help you while someone takes a second look into what you did?
When I oversample, I usually test the model on a hold-out test data set that I saved somewhere else and didn't use for modeling. That gives me some confidence that I didn't fool myself 🙂
Would that be an option for you?
Best,
-Miguel
Hi Nicolas,
Maybe this thread can help you while someone takes a second look into what you did?
When I oversample, I usually test the model on a hold-out test data set that I saved somewhere else and didn't use for modeling. That gives me some confidence that I didn't fool myself 🙂
Would that be an option for you?
Best,
-Miguel
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.