04-19-2016 03:15 AM
This question is an overlap of methodology and SAS programming - hopefully it fits here ...
I wish to build a predictive model with explanatory variables that have different types of missing values.
e.g. (this is made up)
Response Variable: Primary policy holder of a current insurance policy purchases an additional insurance benefit (add-on)
Explanatory Variable 1: Number of customers on policy (no missing values)
Explanatory Variable 2: How the current policy purchased (sales channel) (Missing values = unknown)
Explanatory Variable 3: Country of origin (can include missing values. Missing values = unknown)
Explanatory Variable 4: How many claims has the customer made (0+) (no missing values)
Explanatory Variable 5: Maximum settlement time of claims made (if missing, this is becaue no claims were made)
Explanatory Variable 6: Maximum claim amount (if missing, this is because no claims were made)
Is there a way to distinguish the "missing value" in explanatory variables 5 and 6 (because it is not applicable) as distinct to the missing values in explanatory variables 2 and 3?
Effectively, I want to consider explanatory variable 5 missing category as a category but those in variables 2 and 3 as missing.
My first step was to use hpsplit to gauge what interactions to include in a logistic regression model (as per paper: Methods for Interaction Detection in Predictive Modeling Using SAS ) using hpsplit. I see that SAS has special missing values (.a - .z) however it doesn't seem that hpsplit treats them differently. It seems there is a blanket treatment for all missing values via assignmissing=BRANCH|NONE|POPULAR|SIMILAR (from SAS/STAT® 14.1 User’s Guide The HPSPLIT Procedure)
Any suggestions would be greatly appreciated on how to handle such missing values / interrelated variables.
04-19-2016 07:01 PM - edited 04-28-2016 08:55 AM
Thank-you! Yes, I will use a negative value (this website had told me that my submission of this question had been unsuccessful and so - so I was surprised to receive this reponse here! ...)
Edit: I should also note that for regression I made sure to include interaction terms for variables 4 and 5 and 4 and 6.
Need further help from the community? Please ask a new question.