BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
reterberb
Calcite | Level 5

Hello, 

 

Suppose I have a dataset containing input variables like this:

 

(Binary)            (Interval)         (Binary)

HaveAChild      AgeOfChild     ChildIsMarried

1                      12                   1

0                       .                     .

0                       .                     .

1                      20                   0

1                      11                   1

0                       .                     .

 

In my predictive modelling, I would like to make use of models such as regression or neural networks, which require complete cases.

 

However, the AgeOfChild and ChildIsMarried variables are missing for observations where HaveAChild=0, which is expected since there is no child to begin with.

 

In this case, how can I handle these missing values without discarding them, considering that imputation wouldn't really make sense (e.g. not having a child but having a child age).

 

Thank you. 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
JasonXin
SAS Employee
Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin

View solution in original post

6 REPLIES 6
Ksharp
Super User

Then you should drop these HaveAChild=0, since these obs don't mean anything .

reterberb
Calcite | Level 5
Unfortunately in my scenario, these cases with no child are still important, as there are other input variables which do not depend on whether HaveAChild=1 or 0.

Furthermore, I would want my model to be able to score cases with no child in the future.

Is there any other alternative?
Ksharp
Super User

Then I think you should drop variable  AgeOfChild    ,since this variable is not valid for all obs .

JasonXin
SAS Employee
Hi, If the missing values fall exactly along the line of 1 and 0, then simple imputation does not work, since they will run into total or quasi separation. They will be and should be rejected by logistic regression or NN downright. It is, however, not entirely hopeless, besides the option to drop them. If you do not have EM, and have STAT, take a look into proc MI. You may need to build your final models by the group of values MI plugs in for you. If you have license for EM, under Impute Node, take a look at the Distribution option. In some cases, the Tree option may work but depending other variables, it is possible that you still may not be able to reduce the risk of 'quasi seperation'. Tree Imputation should be the secondary option to try after Distribution. Given that your target=1 typically is very small proportionately, make sure the distribution of non-missing is large, 'normal' or sensible enough for you. Hope this help? Thank you for using SAS. Jason Xin
reterberb
Calcite | Level 5
Hi Jason,

Thanks for your reply. I am using EM for this project and my target variable is actually interval, meaning I would be using linear regression, glms or NN.

Would this method work for interval targets? Another method I thought of is to convert AgeOfChild and ChildIsMarried into categorical variables, with the level "NA" for people without children.
JasonXin
SAS Employee
Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1767 views
  • 0 likes
  • 3 in conversation