topic Re: Missing/Not Applicable Values for Interval Variable in SAS Data Science

Missing/Not Applicable Values for Interval Variable

reterberb — Thu, 17 Nov 2016 09:27:38 GMT

Hello,

Suppose I have a dataset containing input variables like this:

(Binary) (Interval) (Binary)

HaveAChild AgeOfChild ChildIsMarried

1 12 1

0 . .

1 20 0

1 11 1

0 . .

In my predictive modelling, I would like to make use of models such as regression or neural networks, which require complete cases.

However, the AgeOfChild and ChildIsMarried variables are missing for observations where HaveAChild=0, which is expected since there is no child to begin with.

In this case, how can I handle these missing values without discarding them, considering that imputation wouldn't really make sense (e.g. not having a child but having a child age).

Thank you.

Re: Missing/Not Applicable Values for Interval Variable

Ksharp — Thu, 17 Nov 2016 10:21:04 GMT

Then you should drop these HaveAChild=0, since these obs don't mean anything .

Re: Missing/Not Applicable Values for Interval Variable

reterberb — Thu, 17 Nov 2016 10:28:11 GMT

Unfortunately in my scenario, these cases with no child are still important, as there are other input variables which do not depend on whether HaveAChild=1 or 0.

Furthermore, I would want my model to be able to score cases with no child in the future.

Is there any other alternative?

Re: Missing/Not Applicable Values for Interval Variable

Ksharp — Thu, 17 Nov 2016 10:32:59 GMT

Then I think you should drop variable AgeOfChild ,since this variable is not valid for all obs .

Re: Missing/Not Applicable Values for Interval Variable

JasonXin — Sat, 19 Nov 2016 16:46:32 GMT

Hi, If the missing values fall exactly along the line of 1 and 0, then simple imputation does not work, since they will run into total or quasi separation. They will be and should be rejected by logistic regression or NN downright. It is, however, not entirely hopeless, besides the option to drop them. If you do not have EM, and have STAT, take a look into proc MI. You may need to build your final models by the group of values MI plugs in for you. If you have license for EM, under Impute Node, take a look at the Distribution option. In some cases, the Tree option may work but depending other variables, it is possible that you still may not be able to reduce the risk of 'quasi seperation'. Tree Imputation should be the secondary option to try after Distribution. Given that your target=1 typically is very small proportionately, make sure the distribution of non-missing is large, 'normal' or sensible enough for you. Hope this help? Thank you for using SAS. Jason Xin

Re: Missing/Not Applicable Values for Interval Variable

reterberb — Sun, 20 Nov 2016 08:23:02 GMT

Hi Jason,

Thanks for your reply. I am using EM for this project and my target variable is actually interval, meaning I would be using linear regression, glms or NN.

Would this method work for interval targets? Another method I thought of is to convert AgeOfChild and ChildIsMarried into categorical variables, with the level "NA" for people without children.

Re: Missing/Not Applicable Values for Interval Variable

JasonXin — Mon, 21 Nov 2016 14:42:16 GMT

Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin