Solved: Missing/Not Applicable Values for Interval Variable

reterberb · Posted 11-17-2016 04:24 AM

Hello,

Suppose I have a dataset containing input variables like this:

(Binary) (Interval) (Binary)

HaveAChild AgeOfChild ChildIsMarried

1 12 1

0 . .

1 20 0

1 11 1

0 . .

In my predictive modelling, I would like to make use of models such as regression or neural networks, which require complete cases.

However, the AgeOfChild and ChildIsMarried variables are missing for observations where HaveAChild=0, which is expected since there is no child to begin with.

In this case, how can I handle these missing values without discarding them, considering that imputation wouldn't really make sense (e.g. not having a child but having a child age).

Thank you.

JasonXin · Posted 11-21-2016 09:42 AM

Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin

View solution in original post

Ksharp · Posted 11-17-2016 05:21 AM

Then you should drop these HaveAChild=0, since these obs don't mean anything .

reterberb · Posted 11-17-2016 05:28 AM

Unfortunately in my scenario, these cases with no child are still important, as there are other input variables which do not depend on whether HaveAChild=1 or 0.

Furthermore, I would want my model to be able to score cases with no child in the future.

Is there any other alternative?

Ksharp · Posted 11-17-2016 05:32 AM

Then I think you should drop variable AgeOfChild ,since this variable is not valid for all obs .

JasonXin · Posted 11-19-2016 11:46 AM

Hi, If the missing values fall exactly along the line of 1 and 0, then simple imputation does not work, since they will run into total or quasi separation. They will be and should be rejected by logistic regression or NN downright. It is, however, not entirely hopeless, besides the option to drop them. If you do not have EM, and have STAT, take a look into proc MI. You may need to build your final models by the group of values MI plugs in for you. If you have license for EM, under Impute Node, take a look at the Distribution option. In some cases, the Tree option may work but depending other variables, it is possible that you still may not be able to reduce the risk of 'quasi seperation'. Tree Imputation should be the secondary option to try after Distribution. Given that your target=1 typically is very small proportionately, make sure the distribution of non-missing is large, 'normal' or sensible enough for you. Hope this help? Thank you for using SAS. Jason Xin

reterberb · Posted 11-20-2016 03:23 AM

Hi Jason,

Thanks for your reply. I am using EM for this project and my target variable is actually interval, meaning I would be using linear regression, glms or NN.

Would this method work for interval targets? Another method I thought of is to convert AgeOfChild and ChildIsMarried into categorical variables, with the level "NA" for people without children.

JasonXin · Posted 11-21-2016 09:42 AM

Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin

Missing/Not Applicable Values for Interval Variable

Re: Missing/Not Applicable Values for Interval Variable

Re: Missing/Not Applicable Values for Interval Variable

Re: Missing/Not Applicable Values for Interval Variable

Re: Missing/Not Applicable Values for Interval Variable

Re: Missing/Not Applicable Values for Interval Variable

Re: Missing/Not Applicable Values for Interval Variable

Re: Missing/Not Applicable Values for Interval Variable