Solved: Questions about doing classification on extremely imbalanced dataset

YG1992 · Posted 01-17-2018 04:03 AM

Hi everyone,

At present I am trying different models on my dataset and the task is achieving two-class classification. I have made some findings so far:

(1) Although SVM, Random Forest and Neural Network perform well with a pretty nice validating AUC (range from 0.87 to 0.9), the traditional Logistic Regression is also a good model with a validating AUC more than 0.89.

(2) The results of the three machine learning methods are relative stable as long as I don't set some extreme values of hyper-parameters.

(3) The distribution of the dependent variable in my dataset is extremely imbalanced with 98.5% observations in class 1 and 1.15% in class 2 and the second class is what I care about. The sample size is 100k (70k for training, 30k for validating) and the population is even larger.

My questions are:

(1) Do you think that the imbalanced distribution of the dataset impairs the performance of model?

(2) If the oversampling/ undersampling/ SMOTE and other resampling techniques are not allowed, how could I improve my model(from the perspective of AUC)?

(3) How can I improve the prediction precision for the second class?

Thanks very much!

DougWielenga · Posted 02-08-2018 12:44 PM

A few comments first on your initial comments --

(1) Although SVM, Random Forest and Neural Network perform well with a pretty nice validating AUC (range from 0.87 to 0.9), the traditional Logistic Regression is also a good model with a validating AUC more than 0.89.

--> When highly flexible models do not improve the fit greatly over relatively flat models such as those fit by Regression, it suggests that the added flexibility is not necessary. It is also possible that you don't have enough events to create a meaningful pattern. I have been told by consultants who routinely modeled 2% response rates that they did not have great confidence in their performance when there were less than at least 5,000 events. Of course, I would also question whether AUC was the best metric given the low event rate. With really rare events, performance in the top few percentiles (up to 5 or 10%, perhaps) would seem the most relevant since you are only likely to consider taking action on this small subset of observations.

(2) The results of the three machine learning methods are relative stable as long as I don't set some extreme values of hyper-parameters.

--> How are you rating the stability? With such a low event count, I would expect that a few observations one way or the other might change a model's ranking at certain critical percentiles.

(3) The distribution of the dependent variable in my dataset is extremely imbalanced with 98.5% observations in class 1 and 1.15% in class 2 and the second class is what I care about. The sample size is 100k (70k for training, 30k for validating) and the population is even larger.

--> With only 100,000 observations in training, that is only ~1,150 events in training and only ~450 events in validation. It is not clear that you have enough to create a meaningful split and might be better served doing cross-validation on the entire 130,000 observations containing all ~1,600 events.

***

Regarding your questions...

(1) Do you think that the imbalanced distribution of the dataset impairs the performance of model?

While AUC is not directly impacted by a rare event (unlike metrics such as Lift), modeling rare events is challenging since the null model which would assign all observations to the common event is already 98.5% accurate, but misclassification is a poor metric for assessing model performance in rare event scenarios. True model performance is likely best measured by performance on more recent/future data -- perhaps focused on those observations on which you plan to consider taking action (e.g. top 1%? 2% 3%?).

(2) If the oversampling/ undersampling/ SMOTE and other resampling techniques are not allowed, how could I improve my model(from the perspective of AUC)?

As I mentioned, I would consider avoiding splitting the data and considered a cross-validation approach. I'm not sure why you would not be allowed to attempt other techniques such as oversampling the event but even if you did so, it might result in unreasonably optimistic estimates of performance on the actual population even after adjusting for the oversampling.

You might be able to improve the performance on the training data by considering...

... additional variables

... additional models (or ensembles of existing models)

... different sampling strategies

but keep in mind that improving 'performance' on training does not necessarily translate to better performance when the model is deployed, particularly given the relatively small number (not just small percent) of target events.

(3) How can I improve the prediction precision for the second class?

Precision is likewise a challenging concept in the case of rare events. For a 1.5% event rate, someone with a probability of 0.45 (45% chance) is 30 times as likely to have the event but they are still more likely to have the common event (55% chance). You are unlikely to have very many observations that have a very high percentage of the rare event unless you have very strong predictors, so you are often in the situation where you need to predict someone as having the event even when they are more likely to have the non-event.

Hope this helps!

Doug

View solution in original post

DougWielenga · Posted 02-08-2018 12:44 PM

A few comments first on your initial comments --

(1) Although SVM, Random Forest and Neural Network perform well with a pretty nice validating AUC (range from 0.87 to 0.9), the traditional Logistic Regression is also a good model with a validating AUC more than 0.89.

--> When highly flexible models do not improve the fit greatly over relatively flat models such as those fit by Regression, it suggests that the added flexibility is not necessary. It is also possible that you don't have enough events to create a meaningful pattern. I have been told by consultants who routinely modeled 2% response rates that they did not have great confidence in their performance when there were less than at least 5,000 events. Of course, I would also question whether AUC was the best metric given the low event rate. With really rare events, performance in the top few percentiles (up to 5 or 10%, perhaps) would seem the most relevant since you are only likely to consider taking action on this small subset of observations.

(2) The results of the three machine learning methods are relative stable as long as I don't set some extreme values of hyper-parameters.

--> How are you rating the stability? With such a low event count, I would expect that a few observations one way or the other might change a model's ranking at certain critical percentiles.

(3) The distribution of the dependent variable in my dataset is extremely imbalanced with 98.5% observations in class 1 and 1.15% in class 2 and the second class is what I care about. The sample size is 100k (70k for training, 30k for validating) and the population is even larger.

--> With only 100,000 observations in training, that is only ~1,150 events in training and only ~450 events in validation. It is not clear that you have enough to create a meaningful split and might be better served doing cross-validation on the entire 130,000 observations containing all ~1,600 events.

***

Regarding your questions...

(1) Do you think that the imbalanced distribution of the dataset impairs the performance of model?

While AUC is not directly impacted by a rare event (unlike metrics such as Lift), modeling rare events is challenging since the null model which would assign all observations to the common event is already 98.5% accurate, but misclassification is a poor metric for assessing model performance in rare event scenarios. True model performance is likely best measured by performance on more recent/future data -- perhaps focused on those observations on which you plan to consider taking action (e.g. top 1%? 2% 3%?).

(2) If the oversampling/ undersampling/ SMOTE and other resampling techniques are not allowed, how could I improve my model(from the perspective of AUC)?

As I mentioned, I would consider avoiding splitting the data and considered a cross-validation approach. I'm not sure why you would not be allowed to attempt other techniques such as oversampling the event but even if you did so, it might result in unreasonably optimistic estimates of performance on the actual population even after adjusting for the oversampling.

You might be able to improve the performance on the training data by considering...

... additional variables

... additional models (or ensembles of existing models)

... different sampling strategies

but keep in mind that improving 'performance' on training does not necessarily translate to better performance when the model is deployed, particularly given the relatively small number (not just small percent) of target events.

(3) How can I improve the prediction precision for the second class?

Precision is likewise a challenging concept in the case of rare events. For a 1.5% event rate, someone with a probability of 0.45 (45% chance) is 30 times as likely to have the event but they are still more likely to have the common event (55% chance). You are unlikely to have very many observations that have a very high percentage of the rare event unless you have very strong predictors, so you are often in the situation where you need to predict someone as having the event even when they are more likely to have the non-event.

Hope this helps!

Doug

Questions about doing classification on extremely imbalanced dataset

Re: Questions about doing classification on extremely imbalanced dataset

Re: Questions about doing classification on extremely imbalanced dataset