04-25-2014 08:13 AM
I had a quick question, I have created several models, and I use the AUC and MAPE to assess them. MAPE is calculated as below. My question, is the AUC is not good but the MAPE looks OK? How is it possible?..Below are my thresholds and AUC , MAPE results (Table at the Bottom)..So what type of decision should I make? Shall I look at AUC only, or MAPE? Your help will be much appreciated. Thank you
MAPE Calculation = 1/N Sum (|Actual-Predicted|/|Actual|)*100
04-27-2014 10:41 AM
MAPE is usually for models with interval targets (regression, time series, etc.) and not appropriate for scenarios where the actual values can be 0, as this could cause a division by 0 during the MAPE calculation.
AUC is typically for binary classifiers like logistic regression.
Do you have an interval or binary target?
If you have a binary target, what is the event occurrence rate for your target? The situation you describe is common for rare target event occurrences.
To increase your c-statistic/AUC for rare targets:
- Disproportionately over-sample the rare events
- Add a weight to the rare events
- Use an inverse prior distribution
04-28-2014 06:22 AM
Thank you for your response...My target is binary and I have used Logistic Regression to build the model...The response rate varies, some models will have 50%, other 20%, other 10%.. and the lowest has around
So the response rate is not that rare...so Are you saying that for a binary target, I shouldn't use MAPE? What should I use then to compare Actual and Predicted...
04-28-2014 09:57 AM
I would use misclassification rate instead. It depends on your data, but I would be ok with a misclassification rate of 0.3 or less. GINI, AUC, c-statistic and logarithmic loss are other common measures for binary classification accuracy.
If you have a traditional binary target whose values are 0 and 1, then you should not use MAPE because you may be dividing by 0. Even if your binary target has different values than 0 and 1, MAPE and others measures like ASE and RMSE are meant for interval targets. These measures help you understand the average distance between your numeric regression predictions and your numeric observed values. In logistic regression, you are doing a classification, not a prediction. You are labeling cases as belonging to one group or another. The distance between these groups might be arbitrary or hard to understand, and that is why we look at the misclassification rate.
If your misclassification rate is between 0.3 and 0.5, then there are many steps you can take to find a more accurate model, with feature selection being the foremost. Have you tried forward, backward or stepwise variable selection? Another common problem with logistic regression is quasi-complete separation. Are any of your parameters greater than 15 or 20?
Also, your data may just be noisy and difficult to model.
04-30-2014 10:25 AM
Many thanks for coming back to me...when you say Are any of your parameters greater than 15 or 20? What do you mean exactly...Do you mean the number of predictors?
04-30-2014 01:19 PM
I mean: Are your actual estimated parameters very large?
I should have said "magnitude" of 15 or 20 because in the logit space (1/(1+e^-(B0+B1*x1 + ... + Bk*xk))) that makes the exponential values very large - close to a machine infinity - and can cause problems with your model. It is one of the most common problems with logistic regression. For more information: