Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Missing/Not Applicable Values for Interval Variable

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Missing/Not Applicable Values for Interval Variable

[ Edited ]

Hello, 

 

Suppose I have a dataset containing input variables like this:

 

(Binary)            (Interval)         (Binary)

HaveAChild      AgeOfChild     ChildIsMarried

1                      12                   1

0                       .                     .

0                       .                     .

1                      20                   0

1                      11                   1

0                       .                     .

 

In my predictive modelling, I would like to make use of models such as regression or neural networks, which require complete cases.

 

However, the AgeOfChild and ChildIsMarried variables are missing for observations where HaveAChild=0, which is expected since there is no child to begin with.

 

In this case, how can I handle these missing values without discarding them, considering that imputation wouldn't really make sense (e.g. not having a child but having a child age).

 

Thank you. 

 

 


Accepted Solutions
Solution
‎11-23-2016 04:17 AM
SAS Employee
Posts: 122

Re: Missing/Not Applicable Values for Interval Variable

Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin

View solution in original post


All Replies
Super User
Posts: 9,681

Re: Missing/Not Applicable Values for Interval Variable

Then you should drop these HaveAChild=0, since these obs don't mean anything .

New Contributor
Posts: 4

Re: Missing/Not Applicable Values for Interval Variable

Unfortunately in my scenario, these cases with no child are still important, as there are other input variables which do not depend on whether HaveAChild=1 or 0.

Furthermore, I would want my model to be able to score cases with no child in the future.

Is there any other alternative?
Super User
Posts: 9,681

Re: Missing/Not Applicable Values for Interval Variable

Then I think you should drop variable  AgeOfChild    ,since this variable is not valid for all obs .

SAS Employee
Posts: 122

Re: Missing/Not Applicable Values for Interval Variable

Hi, If the missing values fall exactly along the line of 1 and 0, then simple imputation does not work, since they will run into total or quasi separation. They will be and should be rejected by logistic regression or NN downright. It is, however, not entirely hopeless, besides the option to drop them. If you do not have EM, and have STAT, take a look into proc MI. You may need to build your final models by the group of values MI plugs in for you. If you have license for EM, under Impute Node, take a look at the Distribution option. In some cases, the Tree option may work but depending other variables, it is possible that you still may not be able to reduce the risk of 'quasi seperation'. Tree Imputation should be the secondary option to try after Distribution. Given that your target=1 typically is very small proportionately, make sure the distribution of non-missing is large, 'normal' or sensible enough for you. Hope this help? Thank you for using SAS. Jason Xin
New Contributor
Posts: 4

Re: Missing/Not Applicable Values for Interval Variable

Hi Jason,

Thanks for your reply. I am using EM for this project and my target variable is actually interval, meaning I would be using linear regression, glms or NN.

Would this method work for interval targets? Another method I thought of is to convert AgeOfChild and ChildIsMarried into categorical variables, with the level "NA" for people without children.
Solution
‎11-23-2016 04:17 AM
SAS Employee
Posts: 122

Re: Missing/Not Applicable Values for Interval Variable

Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 415 views
  • 0 likes
  • 3 in conversation