Solved: Model Validation on Very Small Test Data Sets II (to the point)

SlutskyFan · Posted 03-07-2014 01:01 PM

This is very much an applied business/predictive modeling problem.

Typically when I have developed predictive models in the past I have trained on a large sample of data (i.e. n~10,000) and performed validation and testing on smaller out of sample data sets (n~3000) that are representative of the size of population I will ultimately score in a production environment. If model performance holds across these data sets, and if in production the model continues to perform well, I feel that I have a useful model.

Typically my models classify subjects into different risk pools. They are not perfect. Sometimes a few people in my 'low risk' group will experience an event, sometimes a few in my high risk group will never experience an event. However, these 'misclassifications' if you want to call them that are within an acceptable (practical) range of tolerance in the production environment, which again involves scoring 'live' cohorts (n~3000).

But now I have been asked, what if I build and validate a model using similar proportions of training, test, and validation data as before, but in the production environment I have very small cohorts (n~5-10). Can I trust my model performance on the much larger training, validation, and test data sets when I'm scoring such a small number of subjects in production? For instance, with the previous business problem, I might put 250 subjects in the high risk group, but 75 actually never experience the event based on the test data. We might devote resources to 250 subjects when only 175 really needed intervention. However the harms from intervention are minimal, and economies of scale put this rate of error within the range of practical tolerance.

I'm just not confident that I can go to scoring 7-10 people in a production environment basing model performance on training, test, and validation data sets in the proportion that I have described. And with such a small cohort, cost per subject of intervention are much higher, and an error rate that would be within the range of tolerance before may no longer be acceptable in such a small space.

So, I want to know anyone's thoughts on how to deal with this problem. Should I continue to train and validate on larger data sets, but test on very small data sets? Because of the uncertainty involved, should I validate on a number of small holdout data sets (almost like assessing the genralization error across a bootstrapped sample of say 500 holdout samples of size 10?) I typically work in a SAS Enterprise Miner environment using gradient boosting, but also have used logistic regression in SAS EG for some of these projects.

Any suggestions would be helpful. Maybe I'm missing the (random) forest for the (decision) trees? That was a joke.

CatTruxillo · Posted 08-11-2017 09:32 AM

Good question, and it got me thinking of two different possible reasons it is being asked. Other readers might have other suggestions, which would be great to hear.

For illustration purposes, I’ll assume a binary target.

First thing that I thought of - If the training and validation data (your large sample) is a good representation of the population of cases that you are later scoring, then model performance should be consistent over a large number of cohorts regardless of size. However, for any specific cohort, if the sample size is small, there will be substantially more variability than you might expect, because the difference between 0 and 1 misclassified cases when the cohort size is 5 means 0 vs 20 % misclassified. In other words, you can only get test assessment of 0, .2, .4, .6, .8. and 1.0 misclassification. You can’t replicate the validation misclassification rate of 0.12, because there’s no way to do that with 5 observations. But, over the long run, with the accumulated results from many small cohorts, you should see convergence towards the “true” misclassification rate.

Second thing I thought of—Why are the test cohorts so very small when you have really big data sets for training and validation? There are a couple of reasons that I see this happen:

SCENARIO A: You have a lot of historical data going back many decades, but only a few new cases pop up each year. Here I’d be concerned about the value of the older data. How similar is the population from 20 year ago to the population today? Are the influencing factors the same? Is the background noise the same? If not, then you might need to reconsider the training and validation data.

SCENARIO B: You have really small scoring cohort because they are somehow special or different than the larger, model development population. I'll use a traffic example to illustrate my point. If you are predicting traffic accidents based on all of the driving activity in a city and its suburbs, you have a large amount of data to work from. However, city planners are most interested in traffic patterns in areas with the most problematic traffic. So, all of the scoring is performed on a particularly busy intersections with a pedestrian crosswalk (the corner of Main and Front Streets). You would certainly expect the model, developed on larger data, to perform poorly now. In a case like this, it would be much more appropriate to use a bootstrap or bagging approach to developing the model using just the population to which the model score code will be applied (heavily traffic, pedestrian-heavy intersections, preferably just Main and Front Streets!).

I hope this helps! Cat

View solution in original post

SlutskyFan · Posted 03-10-2014 09:49 AM

I think my previous post was too long or unclear so here is a pithy restatement of my problem:

I have about 20K obseravations in my training data. What is the best way to test a predictive model (in terms of fit/accuracy/generalaization error) when I will be scoring very small cohorts (n~5-7) in an actual implementation/production environment? Would the initial ROC score on a 80/20 percent split of my training and validation data be relevant? Insead would it be better to code some method that assesses performance across a number of small test samples (n=7-10) through some sort of cross validation or series of boostrap samples?

Thanks.

CatTruxillo · Posted 08-11-2017 09:32 AM

Good question, and it got me thinking of two different possible reasons it is being asked. Other readers might have other suggestions, which would be great to hear.

For illustration purposes, I’ll assume a binary target.

First thing that I thought of - If the training and validation data (your large sample) is a good representation of the population of cases that you are later scoring, then model performance should be consistent over a large number of cohorts regardless of size. However, for any specific cohort, if the sample size is small, there will be substantially more variability than you might expect, because the difference between 0 and 1 misclassified cases when the cohort size is 5 means 0 vs 20 % misclassified. In other words, you can only get test assessment of 0, .2, .4, .6, .8. and 1.0 misclassification. You can’t replicate the validation misclassification rate of 0.12, because there’s no way to do that with 5 observations. But, over the long run, with the accumulated results from many small cohorts, you should see convergence towards the “true” misclassification rate.

Second thing I thought of—Why are the test cohorts so very small when you have really big data sets for training and validation? There are a couple of reasons that I see this happen:

SCENARIO A: You have a lot of historical data going back many decades, but only a few new cases pop up each year. Here I’d be concerned about the value of the older data. How similar is the population from 20 year ago to the population today? Are the influencing factors the same? Is the background noise the same? If not, then you might need to reconsider the training and validation data.

SCENARIO B: You have really small scoring cohort because they are somehow special or different than the larger, model development population. I'll use a traffic example to illustrate my point. If you are predicting traffic accidents based on all of the driving activity in a city and its suburbs, you have a large amount of data to work from. However, city planners are most interested in traffic patterns in areas with the most problematic traffic. So, all of the scoring is performed on a particularly busy intersections with a pedestrian crosswalk (the corner of Main and Front Streets). You would certainly expect the model, developed on larger data, to perform poorly now. In a case like this, it would be much more appropriate to use a bootstrap or bagging approach to developing the model using just the population to which the model score code will be applied (heavily traffic, pedestrian-heavy intersections, preferably just Main and Front Streets!).

I hope this helps! Cat

Model Validation on Very Small Test Data Sets

Re: Model Validation on Very Small Test Data Sets II (to the point)

Model Validation on Very Small Test Data Sets II (to the point)

Re: Model Validation on Very Small Test Data Sets II (to the point)