About kkasahar

kkasahar · ‎05-15-2014

Hi Aditya, You should look at the Cutoff node, which shows number of TP at different cutoff point. By default, SASEM uses 0.5, so you need to adjust this cutoff point since you have imbalanced data. I believe you need to use original prior probability because that is your true population probability. But, I don't think oversampling or underdamping will help you much. Hope it helps Kenji

kkasahar · ‎05-08-2014

Hi Miguel, I really appreciate your advice. Below is the result of your suggestions: A) Gradient Boosting. (Data -> Gradient Boosting) Result: Model Misclassification rate ASE ROC Gini Coef. Y Boost Gradient Boosting 0.005701214 0.00566871 0.5 0 Model Model Description Data Role Target FALSE Negative TRUE Negative FALSE Positive TRUE Positive Node Boost Gradient Boosting TRAIN RESPONSE_IND 6445 1124016 0 0 This model classified all cases as non-response – just like other models that I built – because my response rate is (0.0057 or 6445 cases) and non-response is (0.9943 or 1124016) and True Positive and False Positive are both zero. B) Ensemble of Reg2 and Reg3 (using 10% oversampled data) Selected Model Model Node Valid: Average Profit for RESPONSE_IND Train: Average Squared Error Train: Misclassification Rate Valid: Average Squared Error Valid: Misclassification Rate Y Reg2 1.71 0.090012 0.10002 0.089988 0.099984 Reg3 1.71 0.090012 0.10002 0.089988 0.099984 Ensmbl 1.71 0.098908 0.10002 0.098877 0.099984 C) Bagging, Boosting, and Rotation Forest. This paper you suggested me is great!! I am still working on this part since I have not used these methods. I will let know. Findings from my search ( I am not sure how reliable they are) I was reading SAS support page and they said, “Over-weighting or under-sampling can improve predictive accuracy when there are three or more classes, including at least one rare class and two or more common classes.” My data has two classes, so I guess it is true for my case because my oversampling (omitting cases from common classes) is not helping to obtain the best model regardless of sample sizes. However, when I don't put adjusted prior (using the prior of sample data), my misclassifications increases as I increase sample size as well as the number of True Positive. I care more about the number of True Positive rather than errors. Is it wrong if I keep current prior probablity(not using adjusted prior)? Below is the link to the website. http://support.sas.com/documentation/cdl/en/emxndg/64759/HTML/default/viewer.htm#p1w6fewo0jhzxdn1rytuk1kt0pqj.htm The other thing is how to evaluate models when you need to detect the rare event. Detection rate (Recall) - ratio between the number of correctly detected rare events and the total number of rare events False alarm (false positive) rate – ratio between the number of data records from majority class that are misclassified as rare events and the total number of data records from majority class. ROC Curve is a trade-off between detection rate and false alarm rate so Missclassification, ASE, Average Profit , etc are not sufficient metric for evaluation when rare event?? Here is the title of article: "Data Mining for Analysis of Rare Events: A Case Study in Security, Financial and Medical Applications" Thanks, Miguel! I really appreciate your help.

kkasahar · ‎05-07-2014

Hi All, I have a dataset with very small response rate (0.0057 or 0.57%), so I did oversampling method using several percentages from 8% to 50%. I added decision nodes after sample nodes, and changed prior probably to my original prior probably which is 0.0057 for primary and 0.9943 for secondary. I also changed weights by clicking "Default with Inverse Prior Weights". My ASE or Misclassification rates went up as I increased my sample size from 8% to 50%. My concern is the number of True Positive in my Model Comparison node. They are all smaller than 10 regardless of my sampling or models (decision tree, Neural, logistic Regression, etc.). After cleaning up my data, I have over 1 million rows (6445 for primary and the rest is secondary). How can I do to improve my model or how should I make a decision about which model is the best if the number of True Positive is all less than 10 – many models are zeroes. Also, my ROC charts are not curved (see below). I really appreciate for any advise. Event Classification Table Model Selection based on Valid: Average Profit for RESPONSE_IND (_VAPROF_) Model Data False True False True Node Model Description Role Target Negative Negative Positive Positive Reg Regression TRAIN RESPONSE_IND 3223 29002 0 0 Reg Regression VALIDATE RESPONSE_IND 3222 29003 0 0 Reg2 Regression (2) TRAIN RESPONSE_IND 3222 29002 0 1 Reg2 Regression (2) VALIDATE RESPONSE_IND 3222 29003 0 . Reg3 Regression (3) TRAIN RESPONSE_IND 3222 29002 0 1 Reg3 Regression (3) VALIDATE RESPONSE_IND 3222 29003 0 . Neural3 Neural Network (3) TRAIN RESPONSE_IND 3223 29002 0 0 Neural3 Neural Network (3) VALIDATE RESPONSE_IND 3222 29003 0 0 Neural2 Neural Network (2) TRAIN RESPONSE_IND 3223 29002 0 0 Neural2 Neural Network (2) VALIDATE RESPONSE_IND 3222 29003 0 0 Neural Neural Network TRAIN RESPONSE_IND 3223 29002 0 0 Neural Neural Network VALIDATE RESPONSE_IND 3222 29003 0 0

Online Status	Offline
Date Last Visited	‎09-01-2015 07:11 AM

Re: Regarding use of original prior probabilities in class imbalance p...

Re: True Positive is 0. Oversampling, adjusted prior, weight. Need a ...

True Positive is 0. Oversampling, adjusted prior, weight. Need a help...

Re: True Positive is 0. Oversampling, adjusted prior, weight. Need a ...

Re: Regarding use of original prior probabilities in class imbalance p...

Re: True Positive is 0. Oversampling, adjusted prior, weight. Need a ...

True Positive is 0. Oversampling, adjusted prior, weight. Need a help...