Should missing value imputation and outlier treatment be done prior to splitting data into training and validation data sets? Suppose, i have split my full data into training and validation data. I have done median imputation for missing values and capped data at 1 and 99th percentile in training data set. While imputing missing data and outlier treatment in validation data set, should i use the same median and capping value that were calculated in training data. Would it be fine if i calculate the median and percentile scores according to validation data set? In future, the same process will hold for a new data set in which we do scoring? I know it's not a SAS question. As many analytics professionals are active in this forum, i thought i would get an answer if i post my question here 🙂
If you have a target variable (your final goal is creating a predictive model), then you can use the validation dataset for all your calculations, except the target variable in the validation dataset. This includes calculating means, medians or even a decision tree that predicts an input variable (a model for imputation) or you can use training and validation together to do your favorite dimension reduction (PCA, VarClus, SVD, etc.).
The ultimate goal of having a validation dataset is to protection against over-fitting the target. So it can be used to discover structure in the input variables.
Do you know of any techniques or best practices, that use validation data to protect against detecting a too complex input space? (For example using validation to determine the optimal number of PCA components.)
I'm also interested what others think about this topic.
*you can use the validation dataset for all your calculations* - You mean to say i don't need to stress on calculating mean/median for training data set and putting the value derived from training data set to validation dataset. I can independently calculate mean / median for missing imputation in validation dataset? I use random forest that use validation data to protect against detecting a too complex input space to overcome overfitting problem.
I would strive to use the same preprocessing steps on training and validation (and test and all observations to be scored). So doing a separate mean calculation on training and validation is... risky. But to calculate the overall mean using training and validation and then using it in imputation is acceptable.
Thank you for your reply. What risks you see if i use a separate mean calculation on training and validation? If someone use multiple imputation technique for missing value, how would he ensure same calculation in both training and validation data?
Imputation modifies the distribution of the input variables. It is something we don't like, but sometimes wee need to live with it.
But modifying it in 2 ways (one way for training another way for validation) is even worse I think.
As you see, this is not a formal answer, and I admit, I never experimented with the two approaches.
If your final goal is to create a predictive model (?), which imputation technique will you use when you do prediction for the unknown cases? The one derived from the training or from validation dataset?
By multiple imputation you mean what PROC MI does? To mix multiple imputation and the use of some validation technique might be difficult.
Please someone correct me:
- MI is typically used when we have a rather small dataset (with missing values) and we have a (theoretical) model that we want to estimate.
- Validation dataset is usually used, when we don't know exactly the model, so we will try a series of models, and use validation dataset to select the best. Typically we have more observations (and also columns) in this case.
As a first try I would simply concatenate training and validation datasets and run it. (But not using the target variable.) But you willl need to decide on how to impute using multiple imputation when you predict one unknown case.
If you use random forests (or a tree), do you need imputation at all?
Hello, there, I am also a medical data analyst that has worked on multiple imputation. First of all, the history of methods for dealing with missing data is rather short. One of the earliest endeavor on that dates back to the 1920s (some 100 years ago), but it was not until the late 1970s (1979) that massive investigation and research on missing data had been carried out. The history of massive research into missing data may be younger than some of the users of SAS Community. (See van Buuren's Flexible Imputation of Missing Data, Second Edition for a more detailed description of the history of human's endeavor to handling missing data). To date, much of the problems regarding missing data remain unsolved. For instance, the theory of generalized additive models (GAMs) was first proposed in the 1980s, but it was not until 2017 (to the best of my knowledge) that the first article on handling missing data in GAMs was published.
Now I can answer some of your questions. I cannot answer all because some of them may have remain unsolved and perhaps you should try to browse on the Internet to see someone rather than me or anyone in this Community has given an answer to your question.
@Ujjawal wrote:
Should missing value imputation and outlier treatment be done prior to splitting data into training and validation data sets?
The answer to the question regarding splitting prior to imputation is "yes". See the paper entitled The estimation and use of predictions for the assessment of model performance using large samples wi... for details. Please note that this paper was published on 29th January 2015, months before you raised the question in this Community. So it is possible that people have been working on your problem but have not reached a solution that is viable enough to be published.
This may be the case of outlier detection and treatment, which is also troubling me. To date (15th October 2023), despite the presence of multiple statistics that are capable of detecting outlier in a single sample in a complete dataset (e.g., Cook's Distance in linear regression), there seems to be no counterpart of these in the arena of multiple imputation (MI). I guess that one of the reasons for the absence may be from the fact that multiple samples are created in MI, thereby causing confusions to whether the data analyst should use one of the imputed sample or the pooled imputed sample to calculate the statistics. So I guess that single imputation might be a potential choice to detect outliers. But of course, my idea has not been validated, so you should browse the Internet to look for answers.
Good luck!
Don't expect much response from the original poster. Apparently hasn't been on the forum for nearly 6 years.
Thank you for your reminder! Actually, I helped that person not because I anticipated reply from him/her. It seems that he/she had not known much about multiple imputation by then, so I had not expected fruitful reply from the original poster in the first place. Maybe his/her reply to my message would simply be "Thank you". But of course things will change in six years, so maybe the poster has been a master of multiple imputation by now. But it's OK that he/she never respond to my message.
The reason why I gave my reply was that I felt happy in the course of doing so. Also, I feel that the problems he/she encountered are in fact commonplace. Other people may benefit from viewing my post. In addition, by replying to him/her, I would like to arouse the attention of statisticians in this Community that there are still frequently encountered problems unsolved in the realm of multiple imputation. Furthermore, such a problem is also ubiquitous in resampling where multiple samples are created. It would be of my great honor if my questions raised here eventually turned first into research projects of statisticians and then solutions to them published in literatures. This is a win-win situation for both of us.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.