Re: Missing values

NicolasC · Posted 05-24-2018 11:37 AM

Hi there

I have a probably trivial question.

Is there a rule of thumb concerning the threshold for the maximum percentage a variable with missing values can be used. 20% 25% etc....in the extreme case of variables with 60% missing values let's say, is that better to dismiss it or replace it with a binary variable with 1 (if not missing) and 0 (if missing) hence leading to 60% of 0 and 40% of 1.

Thanks

Nicolas

ballardw · Posted 05-24-2018 12:03 PM

Sounds like a project specific rule. Life and death decisions? I might not trust any data with missing values for key values. Analysis of my local D&D gaming group characters would be pretty lenient.

The question might be "why are those missing?". In survey data quite often you have missing data due to "skip patterns", a respondent did not qualify for a block of questions (males should seldom be asked about pregnancy status, or non-smokers how many cigarettes to do they smoke daily). The "missing" are excluded from analysis for those topics as not applicable (SAS is good about that).

If there is a systemic reason they are missing (John forgot to enter that value but Mary did) then the reason (data recorder) might become an affect in the analysis (or at least examined to see if you get different results for the two entry personnel).

If the data were collected in such a way that zero values were not recorded (not completely uncommon - amount of money spent on rutabagas last week) it might be appropriate to replace with 0, or possible other default value, without the recoding to binary.

Do you have a large enough sample for reasonable analysis without recoding?

This also is related to the idea of Imputing values for missing.

I think this question falls under the heading in the "art" part of analysis, which is possibly a nicer way of saying "it depends".

NicolasC · Posted 05-25-2018 04:24 AM

Thanks for your answer. There are indeed missing values that have a reason for being missing as you said (in my case, one of my variables is the average number of months between two orders from a customer....if this value is missing, it can imply either missing but also just simply states that only one order was made....). In the extreme case of a variables with say 70% missing values and assuming you have other relevant variables you can build a model with, I see no point taking it. Even binning (5 bins say) it and create a spare class for those missing values means that you have 1 class present for 70% of the population. Hence my initial thought to have a threshold of roughly 20%, so in a case of binning we get 5 classes of same size.

Nicolas

ballardw · Posted 05-25-2018 04:59 PM

I might start with that "binary" added variable and use it as a by group for summarizations or some basic analysis to see if there are real differences, what ever "real difference" may mean for your project when examined across those two groups. This might tell you something else about why certain variables are missing or tend to be missing.

Suppose I am looking at a variable with many missing and summarizing income and see something like the missing variable population has a mean income of 10000, and a standard deviation of 2000 but the non-missing population has a mean income of 75,000 and sd of 12000. There may be an income related reason that things are missing (after excluding known systemic like your example of difference between 2 values means you have to have a second).

If I now luck at that variable with many missing values I might find that it represented number of bottles of a very expensive champagne purchased in the past year. A possible candidate for replacing missing with 0 and likely reflects on a data collection protocol.