11-11-2014 03:04 AM

Hi.

During the Modelling process, if we have selected some of the variables and we see that for one of the very significant variable the values are missing so is there a threshold or industry specified % for the variable to be kept in the model and ignore the missing values.

I understand that we can replace the missing values with either the mean or the median of the variable however wanted to see if statisitcally there is a threshold.

Regards, Shivi

11-11-2014 11:37 PM

There is no agreed threshold. If you are doing a regression-type model, and the missing value is a RHS variable, then one common work-around is to add 'missing' as a separate variable.

More specifically, if your missing variable is categorical, you will be adding in dummy variables for each category. You can then just have another category 'missing'.

If the missing varialbe is continuous, then you can code missing values to some arbitrary value (eg zero) and also include an additional dummy variable equal to one if the variable is missing.