05-08-2014 01:37 AM
I am working on a class imbalance problem where Y:N ratio is approx 1:49 with high overlap amongst the two classes (binary classification problem). There are roughly 1850 Y records (tgt = Y, of interest) while 90,000 are tgt = N instances.
One of the strategies is to approach this class imbalance problem is using SMOTE and say Tomek Link (through R or similar software) to achieve a balanced dataset of Y:N almost 50:50. Post balancing, the data manipulations like transformations, binning etc have been implemented and 2 modeling techniques have been used to find the results.
The original dataset is once again used with role as "Score" to find out how many instances before data balancing have been correctly classified. In such situations, is it necessary to used the "Adjusted Priors" in the decision processing matrix of the balanced input dataset (where I enter the original priors and use them) and then do the scoring? Or I can do the scoring straight away without using adjusted priors.
In the first scenario (adjusted priors = original priors) the count of true positives is around 150 (both predicted and actual = Y) whereas in the second scenario (no adjusted priors used) it is around 600.
Would appreciate help in this regard from the community members !
05-09-2014 11:42 AM
I am not familiar with SMOTE or Tomek Link, but it sounds like something worth trying in the Open Source Integration node in SAS Enterprise Miner. I will check that out and compare it to my regular rare target event methodology!
Top of my head, I would say that tree-ensembles are really well suited for rare targets. Give it a try with boosting, bagging, and gradient boosting.
In addition to a rare target, what else are you dealing with in your Analytics task?
Feel free to follow up.
05-09-2014 01:59 PM
Thanks for your quick response. Yes, you guessed it right, I am working on a binary classification problem with imbalanced proportions of Y and N.
Now, the issue of class imbalance is not solely responsible for poor classification performance of the model/s. This coupled with an overlap among the target classes (or rare event instances occuring in smaller disjuncts/islands) further complicates the rare event classification problem. SMOTE+Tomek Links is just one of the handful techniques aimed at achieving a "balanced" dataset, on which traditional classifiers work well.
I would rephrase my original question as: If I create a balanced dataset with Y:N almost the same, then should I still use the adjusted priors (decision processing settings for input data in SAS EM), before running any model. Later on, I still would use my original dataset for scoring, just to check how many instances fall under TP, TN etc.
I think this should be correct as I have seen few examples (PVK'97 Donor dataset or similar) where they have started with a balanced dataset, but then used the adjusted priors same as the original priors, before running the decision tree models etc.
I shall go through the link shared by you in detail once again, as I see it much useful.
Earlier, I tried using boosting and gradient boosting on a similar dataset with the following results:
1) Embed decision tree node between start and end group node, select boosting with 5 iterations and run the process flow: SAS EM gives good TP, but there are large number of FP also (needless to say interpreting these results as patterns was daunting task which I somehow managed). However, higher iterations of boosting lead to diminished performance as well.
2) Gradient Boosting: SAS EM Gradient Boosting didnt yield me any results (not really sure as to why?). I assume gradient boosting works for binary targets as well.
Do share your thoughts/experiences on the same. Hope this information is useful to you.
05-15-2014 01:44 PM
You should look at the Cutoff node, which shows number of TP at different cutoff point. By default, SASEM uses 0.5, so you need to adjust this cutoff point since you have imbalanced data. I believe you need to use original prior probability because that is your true population probability. But, I don't think oversampling or underdamping will help you much.
Hope it helps
05-15-2014 03:20 PM
The tip Miguel posted on the cutoff node might be useful to you: