12-01-2015 08:30 AM
I built a data set for scoring purpose. And it needed multiple merges between many tables especially left joins.
that means I have many missing values for customers who aren't present in the second table of the join.
For example, the variable indicating the number of times the customer has opened an email will be missing for those who have not an email.
So a missing value is a different information from a 0 which indicates that the customer recieved emails but didn't open them!
I replaced missing values with 0, but I think it's not optimal.
My question is is there a way to distinguish between missing values and 0, so that it can be taken into account by the model (logistic regression which don't accept missing values)
Any advice is appreciated!
12-01-2015 11:01 AM
12-02-2015 09:23 AM
It's exactly about 'informativeness of the missing". I didn't know about this name before.
You described the problem very well. Your suggested solution works very well if it's only about a flag (Binary or even nominal variable).
However, when the variable is numeric it's more complicated to deal with because -1 or even -9999 is considered as a value of the variable's range and not as an extra independent information. this will effect the estimated parameter of the variable. Not sure it's relevant.
I think that the model can distinguish between both populations (those who don't have an email and those who didn't open emails) as long as a variable "flag email" is present in the final model. This is because it takes into account different interactions between variables...But this it remains a hypothesis.
I'm not sure if there is a solution that sas stat or SAS Enterprise Miner consider. But it could be a format like this
Proc format; value nb_opened_email_fmt . = "No email"; run;
Or may be a special missing value.
I didn't try anyone of them yet but it can solve the problem.