SAS Data Science

GuyTreepwood · Posted 03-15-2022 04:56 PM

Hello,

I am currently working on a binary classification problem with a severe class imbalance (~.05% are event cases, and 99.95% are non-event cases). I built baseline models on the raw, imbalanced data, with a stratified train/validation/test split (70%, 15%, 15% respectively). All of the models were fitted using SAS Viya 3.5 procedures in SAS Studio (TREESPLIT, LOGSELECT, FOREST, GRADBOOST, NNET, SVMACHINE) with a PARTITION statement.

When I calculated some of the model performance statistics, the performance stats for the test dataset were better than the stats for the validation dataset for all of the models. In some cases, the stats for the test dataset were better than the training dataset stats. I would think that the stats for validation and test should be about the same. Does anyone know why this happens? Can I trust these results and models?

Here are the AUC stats for the models that I built to illustrate what I have been seeing. Please note that I have seen the same patterns for other model performance statistics (i.e. Lift, Captured Cumulative Captured Reponse %).

Model	Train	Validation	Test
Random Forest	99.97%	86.77%	92.54%
SVM	98.34%	98.05%	99.00%
Decision Tree	91.93%	83.62%	89.36%
Logistic Regression	88.45%	84.63%	90.11%
Gradient Boosting	83.00%	78.05%	82.85%
Neural Network	82.57%	75.07%	79.57%

ballardw · Posted 03-16-2022 11:53 AM

Probably close to impossible to pinpoint why without things like the starting data, the actual methods used to create the different sets and the code used to compare. You don't mention your original sample size or how many records are in the different sets. May be a sample size issue as 0.05% is 5 in 10,000 cases.

If I were concerned about such a result I would go back and create different train/validate data set with different random selections, possibly several times, and see if this repeats.

GuyTreepwood · Posted 03-22-2022 05:13 PM

Hello,

The data I am working with has about 1.8 million records, and about 900 event cases. This means that I have about 630 cases (~1.26 million non-event records) in the training dataset, and approximately 135 event cases in the validation and test dataset (~270k non-event records in both datasets).

This is the code that I used to create the three separate datasets:

proc surveyselect data=whole_population samprate=.70
out = whole_population_strat seed=54132 outall stratumseed=restore;
strata target;
run;

/************** Creating training and validation datasets ************/
data train (drop=selected) val_test (drop=train_set_ind SelectionProb SamplingWeight);
set whole_population_strat ;

if selected = 1 then train_set_ind = 1;

if selected = 1 then output train;
else if selected ne 1 then output val_test;
run;

*********** Creating Validation/Test split indicator *************;
proc surveyselect data=val_test (drop=selected) samprate=.50
out = val_test_strat seed=54133 outall stratumseed=restore;
strata target;
run;

data val (drop=selected) test(drop=selected);
set val_test_strat ;

if selected = 1 then val_set_ind = 1;

if selected ne 1 then test_set_ind = 1;

if selected = 1 then output val;
else if selected ne 1 then output test;
run;

I did re-run the code with different random selections, and saw mostly similar results, with one run showing the opposite results I posted in my initial question, meaning that the performance was better on the validation set when compared to the test set. I would think I am ok as long as I am getting similar results with different random splits, and that the performance metrics for the validation test sets aren't too different, correct?

Thanks,

Al

SAS Data Science

Model Performance Stats Better on Test Dataset Compared to Validation Dataset: Should I trust them?

Re: Model Performance Stats Better on Test Dataset Compared to Validation Dataset: Should I trust th

Re: Model Performance Stats Better on Test Dataset Compared to Validation Dataset: Should I trust th

Do Americans Trust Scientific Experts?

Compare all types of models within SAS Model Manager Demo

compare 2 datasets

compare two datasets

compare two dataset

Follow Us

What is...

SAS Data Science

Our biggest data and AI event of the year.

Follow Us

What is...