Investigating the Effect of Severe Class Imbalances in ML Classification Scenarios

We are often faced with situations when there is a severe imbalance in the frequencies of categories in classification problems. If one class is severely underrepresented in a two-class problem, we worry that our results will be biased towards the majority class since it appears to have the highest accuracy of all the classes under consideration. For example, in addressing a fraud problem, there may be a very low incidence of fraudulent transactions, say << 1%. In such a situation, using a naive stratified sample as input to a classification algorithm may simply result in the algorithm choosing 'non-fraudulent' the majority of the time, which will create false negative results because of the overwhelming incidence of legitimate transactions.
One remedy to this case is simply to oversample the known fraudulent transactions in the expectation that a classification algorithm will perform more accurately in defining the boundary between 'fraud' and 'nonfraud' transactions. We explore this paradigm in our paper, "Hybrid Rare Event Sampling Technique", for which we wrote the %HYRES macro to generate samples containing specified frequencies or percentages of events and nonevents. In this paper, we selected a number of datasets with different characteristics and frequencies of events and nonevents.

The %HYRES macro is included as an attachment to this post for the SAS community to use as a tool for creating specific datasets with class imbalances.

Investigating the Effect of Severe Class Imbalances in ML Classification Scenarios

The 2025 SAS Hackathon has begun!

SAS AI and Machine Learning Courses