BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
YG1992
Obsidian | Level 7

Hi everyone,

 

At present I am trying different models on my dataset and the task is achieving two-class classification. I have made some findings so far:

(1) Although SVM, Random Forest and Neural Network perform well with a pretty nice validating AUC (range from 0.87 to 0.9), the traditional Logistic Regression is also a good model with a validating AUC more than 0.89.

(2) The results of the three machine learning methods are relative stable as long as I don't set some extreme values of hyper-parameters.

(3) The distribution of the dependent variable in my dataset is extremely imbalanced with 98.5% observations in class 1 and 1.15% in class 2 and the second class is what I care about. The sample size is 100k (70k for training, 30k for validating) and the population is even larger.

 

My questions are:

(1) Do you think that the imbalanced distribution of the dataset impairs the performance of model?

(2) If the oversampling/ undersampling/ SMOTE and other resampling techniques are not allowed, how could I improve my model(from the perspective of AUC)?

(3) How can I improve the prediction precision for the second class?

 

Thanks very much!

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

A few comments first on your initial comments -- 

(1) Although SVM, Random Forest and Neural Network perform well with a pretty nice validating AUC (range from 0.87 to 0.9), the traditional Logistic Regression is also a good model with a validating AUC more than 0.89.

         -->  When highly flexible models do not improve the fit greatly over relatively flat models such as those fit by Regression, it suggests that the added flexibility is not necessary.  It is also possible that you don't have enough events to create a meaningful pattern.  I have been told by consultants who routinely modeled 2% response rates that they did not have great confidence in their performance when there were less than at least 5,000 events.  Of course, I would also question whether AUC was the best metric given the low event rate.  With really rare events, performance in the top few percentiles (up to 5 or 10%, perhaps) would seem the most relevant since you are only likely to consider taking action on this small subset of observations.

(2) The results of the three machine learning methods are relative stable as long as I don't set some extreme values of hyper-parameters.

          --> How are you rating the stability?  With such a low event count, I would expect that a few observations one way or the other might change a model's ranking at certain critical percentiles.

(3) The distribution of the dependent variable in my dataset is extremely imbalanced with 98.5% observations in class 1 and 1.15% in class 2 and the second class is what I care about. The sample size is 100k (70k for training, 30k for validating) and the population is even larger.

            --> With only 100,000 observations in training, that is only ~1,150 events in training and only ~450 events in validation.   It is not clear that you have enough to create a meaningful split and might be better served doing cross-validation on the entire 130,000 observations containing all ~1,600 events.

 

***

 

Regarding your questions...

 

(1) Do you think that the imbalanced distribution of the dataset impairs the performance of model?

 

       While AUC is not directly impacted by a rare event (unlike metrics such as Lift), modeling rare events is challenging since the null model which would assign all observations to the common event is already 98.5% accurate, but misclassification is a poor metric for assessing model performance in rare event scenarios.  True model performance is likely best measured by performance on more recent/future data -- perhaps focused on those observations on which you plan to consider taking action (e.g. top 1%?  2%  3%?).   

(2) If the oversampling/ undersampling/ SMOTE and other resampling techniques are not allowed, how could I improve my model(from the perspective of AUC)?

 

      As I mentioned, I would consider avoiding splitting the data and considered a cross-validation approach.  I'm not sure why you would not be allowed to attempt other techniques such as oversampling the event but even if you did so, it might result in unreasonably optimistic estimates of performance on the actual population even after adjusting for the oversampling. 

 

You might be able to improve the performance on the training data by considering...

    ... additional variables

    ... additional models (or ensembles of existing models)

    ... different sampling strategies

 

but keep in mind that improving 'performance' on training does not necessarily translate to better performance when the model is deployed, particularly given the relatively small number (not just small percent) of target events.

(3) How can I improve the prediction precision for the second class?

 

Precision is likewise a challenging concept in the case of rare events.  For a 1.5% event rate, someone with a probability of 0.45 (45% chance) is 30 times as likely to have the event but they are still more likely to have the common event (55% chance).  You are unlikely to have very many observations that have a very high percentage of the rare event unless you have very strong predictors, so you are often in the situation where you need to predict someone as having the event even when they are more likely to have the non-event.  

 

Hope this helps!

Doug

View solution in original post

1 REPLY 1
DougWielenga
SAS Employee

A few comments first on your initial comments -- 

(1) Although SVM, Random Forest and Neural Network perform well with a pretty nice validating AUC (range from 0.87 to 0.9), the traditional Logistic Regression is also a good model with a validating AUC more than 0.89.

         -->  When highly flexible models do not improve the fit greatly over relatively flat models such as those fit by Regression, it suggests that the added flexibility is not necessary.  It is also possible that you don't have enough events to create a meaningful pattern.  I have been told by consultants who routinely modeled 2% response rates that they did not have great confidence in their performance when there were less than at least 5,000 events.  Of course, I would also question whether AUC was the best metric given the low event rate.  With really rare events, performance in the top few percentiles (up to 5 or 10%, perhaps) would seem the most relevant since you are only likely to consider taking action on this small subset of observations.

(2) The results of the three machine learning methods are relative stable as long as I don't set some extreme values of hyper-parameters.

          --> How are you rating the stability?  With such a low event count, I would expect that a few observations one way or the other might change a model's ranking at certain critical percentiles.

(3) The distribution of the dependent variable in my dataset is extremely imbalanced with 98.5% observations in class 1 and 1.15% in class 2 and the second class is what I care about. The sample size is 100k (70k for training, 30k for validating) and the population is even larger.

            --> With only 100,000 observations in training, that is only ~1,150 events in training and only ~450 events in validation.   It is not clear that you have enough to create a meaningful split and might be better served doing cross-validation on the entire 130,000 observations containing all ~1,600 events.

 

***

 

Regarding your questions...

 

(1) Do you think that the imbalanced distribution of the dataset impairs the performance of model?

 

       While AUC is not directly impacted by a rare event (unlike metrics such as Lift), modeling rare events is challenging since the null model which would assign all observations to the common event is already 98.5% accurate, but misclassification is a poor metric for assessing model performance in rare event scenarios.  True model performance is likely best measured by performance on more recent/future data -- perhaps focused on those observations on which you plan to consider taking action (e.g. top 1%?  2%  3%?).   

(2) If the oversampling/ undersampling/ SMOTE and other resampling techniques are not allowed, how could I improve my model(from the perspective of AUC)?

 

      As I mentioned, I would consider avoiding splitting the data and considered a cross-validation approach.  I'm not sure why you would not be allowed to attempt other techniques such as oversampling the event but even if you did so, it might result in unreasonably optimistic estimates of performance on the actual population even after adjusting for the oversampling. 

 

You might be able to improve the performance on the training data by considering...

    ... additional variables

    ... additional models (or ensembles of existing models)

    ... different sampling strategies

 

but keep in mind that improving 'performance' on training does not necessarily translate to better performance when the model is deployed, particularly given the relatively small number (not just small percent) of target events.

(3) How can I improve the prediction precision for the second class?

 

Precision is likewise a challenging concept in the case of rare events.  For a 1.5% event rate, someone with a probability of 0.45 (45% chance) is 30 times as likely to have the event but they are still more likely to have the common event (55% chance).  You are unlikely to have very many observations that have a very high percentage of the rare event unless you have very strong predictors, so you are often in the situation where you need to predict someone as having the event even when they are more likely to have the non-event.  

 

Hope this helps!

Doug

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 2615 views
  • 0 likes
  • 2 in conversation