03-19-2015 03:15 PM
Should missing value imputation and outlier treatment be done prior to splitting data into training and validation data sets? Suppose, i have split my full data into training and validation data. I have done median imputation for missing values and capped data at 1 and 99th percentile in training data set. While imputing missing data and outlier treatment in validation data set, should i use the same median and capping value that were calculated in training data. Would it be fine if i calculate the median and percentile scores according to validation data set? In future, the same process will hold for a new data set in which we do scoring? I know it's not a SAS question. As many analytics professionals are active in this forum, i thought i would get an answer if i post my question here :-)
03-19-2015 11:12 PM
If you have a target variable (your final goal is creating a predictive model), then you can use the validation dataset for all your calculations, except the target variable in the validation dataset. This includes calculating means, medians or even a decision tree that predicts an input variable (a model for imputation) or you can use training and validation together to do your favorite dimension reduction (PCA, VarClus, SVD, etc.).
The ultimate goal of having a validation dataset is to protection against over-fitting the target. So it can be used to discover structure in the input variables.
Do you know of any techniques or best practices, that use validation data to protect against detecting a too complex input space? (For example using validation to determine the optimal number of PCA components.)
I'm also interested what others think about this topic.
03-20-2015 04:07 PM
*you can use the validation dataset for all your calculations* - You mean to say i don't need to stress on calculating mean/median for training data set and putting the value derived from training data set to validation dataset. I can independently calculate mean / median for missing imputation in validation dataset? I use random forest that use validation data to protect against detecting a too complex input space to overcome overfitting problem.
03-20-2015 04:58 PM
I would strive to use the same preprocessing steps on training and validation (and test and all observations to be scored). So doing a separate mean calculation on training and validation is... risky. But to calculate the overall mean using training and validation and then using it in imputation is acceptable.
03-21-2015 02:17 PM
Thank you for your reply. What risks you see if i use a separate mean calculation on training and validation? If someone use multiple imputation technique for missing value, how would he ensure same calculation in both training and validation data?
03-23-2015 06:51 AM
Imputation modifies the distribution of the input variables. It is something we don't like, but sometimes wee need to live with it.
But modifying it in 2 ways (one way for training another way for validation) is even worse I think.
As you see, this is not a formal answer, and I admit, I never experimented with the two approaches.
If your final goal is to create a predictive model (?), which imputation technique will you use when you do prediction for the unknown cases? The one derived from the training or from validation dataset?
By multiple imputation you mean what PROC MI does? To mix multiple imputation and the use of some validation technique might be difficult.
Please someone correct me:
- MI is typically used when we have a rather small dataset (with missing values) and we have a (theoretical) model that we want to estimate.
- Validation dataset is usually used, when we don't know exactly the model, so we will try a series of models, and use validation dataset to select the best. Typically we have more observations (and also columns) in this case.
As a first try I would simply concatenate training and validation datasets and run it. (But not using the target variable.) But you willl need to decide on how to impute using multiple imputation when you predict one unknown case.
If you use random forests (or a tree), do you need imputation at all?