SAS Data Science

Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Viya (Machine Learning), SAS Visual Text Analytics, with point-and-click interfaces or programming
BookmarkSubscribeRSS Feed
GuyTreepwood
Obsidian | Level 7

Hello,

 

I am currently working on a binary classification problem with a severe class imbalance (~.05% are event cases, and 99.95% are non-event cases). I built baseline models on the raw, imbalanced data, with a stratified train/validation/test split (70%, 15%, 15% respectively). All of the models were fitted using SAS Viya 3.5 procedures in SAS Studio (TREESPLIT, LOGSELECT, FOREST, GRADBOOST, NNET, SVMACHINE) with a PARTITION statement. 

 

When I calculated some of the model performance statistics, the performance stats for the test dataset were better than the stats for the validation dataset for all of the models. In some cases, the stats for the test dataset were better than the training dataset stats. I would think that the stats for validation and test should be about the same. Does anyone know why this happens? Can I trust these results and models? 

 

Here are the AUC stats for the models that I built to illustrate what I have been seeing. Please note that I have seen the same patterns for other model performance statistics (i.e. Lift, Captured Cumulative Captured Reponse %). 

 

ModelTrainValidationTest
Random Forest99.97%86.77%92.54%
SVM98.34%98.05%99.00%
Decision Tree91.93%83.62%89.36%
Logistic Regression88.45%84.63%90.11%
Gradient Boosting83.00%78.05%82.85%
Neural Network82.57%75.07%79.57%

 

2 REPLIES 2
ballardw
Super User

Probably close to impossible to pinpoint why without things like the starting data, the actual methods used to create the different sets and the code used to compare. You don't mention your original sample size or how many records are in the different sets. May be a sample size issue as 0.05% is 5 in 10,000 cases.

 

If I were concerned about such a result I would go back and create different train/validate data set with different random selections, possibly several times, and see if this repeats.

GuyTreepwood
Obsidian | Level 7

Hello,

 

The data I am working with has about 1.8 million records, and about 900 event cases. This means that I have about 630 cases (~1.26 million non-event records) in the training dataset, and approximately 135 event cases in the validation and test dataset (~270k non-event records in both datasets).  

 

This is the code that I used to create the three separate datasets:

 

proc surveyselect data=whole_population samprate=.70
out = whole_population_strat seed=54132 outall stratumseed=restore;
strata target;
run;

 

/************** Creating training and validation datasets ************/
data train (drop=selected) val_test (drop=train_set_ind SelectionProb SamplingWeight);
set whole_population_strat ;

if selected = 1 then train_set_ind = 1;

if selected = 1 then output train;
else if selected ne 1 then output val_test;
run;

 

*********** Creating Validation/Test split indicator *************;
proc surveyselect data=val_test (drop=selected) samprate=.50
out = val_test_strat seed=54133 outall stratumseed=restore;
strata target;
run;

 

data val (drop=selected) test(drop=selected);
set val_test_strat ;

if selected = 1 then val_set_ind = 1;

if selected ne 1 then test_set_ind = 1;

if selected = 1 then output val;
else if selected ne 1 then output test;
run;

 

I did re-run the code with different random selections, and saw mostly similar results, with one run showing the opposite results I posted in my initial question, meaning that the performance was better on the validation set when compared to the test set. I would think I am ok as long as I am getting similar results with different random splits, and that the performance metrics for the validation test sets aren't too different, correct?

Thanks,

Al

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1249 views
  • 0 likes
  • 2 in conversation