This is very much an applied business/predictive modeling problem. Typically when I have developed predictive models in the past I have trained on a large sample of data (i.e. n~10,000) and performed validation and testing on smaller out of sample data sets (n~3000) that are representative of the size of population I will ultimately score in a production environment. If model performance holds across these data sets, and if in production the model continues to perform well, I feel that I have a useful model. Typically my models classify subjects into different risk pools. They are not perfect. Sometimes a few people in my 'low risk' group will experience an event, sometimes a few in my high risk group will never experience an event. However, these 'misclassifications' if you want to call them that are within an acceptable (practical) range of tolerance in the production environment, which again involves scoring 'live' cohorts (n~3000). But now I have been asked, what if I build and validate a model using similar proportions of training, test, and validation data as before, but in the production environment I have very small cohorts (n~5-10). Can I trust my model performance on the much larger training, validation, and test data sets when I'm scoring such a small number of subjects in production? For instance, with the previous business problem, I might put 250 subjects in the high risk group, but 75 actually never experience the event based on the test data. We might devote resources to 250 subjects when only 175 really needed intervention. However the harms from intervention are minimal, and economies of scale put this rate of error within the range of practical tolerance. I'm just not confident that I can go to scoring 7-10 people in a production environment basing model performance on training, test, and validation data sets in the proportion that I have described. And with such a small cohort, cost per subject of intervention are much higher, and an error rate that would be within the range of tolerance before may no longer be acceptable in such a small space. So, I want to know anyone's thoughts on how to deal with this problem. Should I continue to train and validate on larger data sets, but test on very small data sets? Because of the uncertainty involved, should I validate on a number of small holdout data sets (almost like assessing the genralization error across a bootstrapped sample of say 500 holdout samples of size 10?) I typically work in a SAS Enterprise Miner environment using gradient boosting, but also have used logistic regression in SAS EG for some of these projects. Any suggestions would be helpful. Maybe I'm missing the (random) forest for the (decision) trees? That was a joke.
... View more