We are often faced with situations when there is a severe imbalance in the frequencies of categories in classification problems. If one class is severely underrepresented in a two-class problem, we worry that our results will be biased towards the majority class since it appears to have the highest accuracy of all the classes under consideration. For example, in addressing a fraud problem, there may be a very low incidence of fraudulent transactions, say << 1%. In such a situation, using a naive stratified sample as input to a classification algorithm may simply result in the algorithm choosing 'non-fraudulent' the majority of the time, which will create false negative results because of the overwhelming incidence of legitimate transactions.
One remedy to this case is simply to oversample the known fraudulent transactions in the expectation that a classification algorithm will perform more accurately in defining the boundary between 'fraud' and 'nonfraud' transactions. We explore this paradigm in our paper, "Hybrid Rare Event Sampling Technique", for which we wrote the %HYRES macro to generate samples containing specified frequencies or percentages of events and nonevents. In this paper, we selected a number of datasets with different characteristics and frequencies of events and nonevents.
The %HYRES macro is included as an attachment to this post for the SAS community to use as a tool for creating specific datasets with class imbalances.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.