Re: missing values VS 0 in a logistic regression

majdi_ka · Posted 12-01-2015 08:30 AM

Hi,

I built a data set for scoring purpose. And it needed multiple merges between many tables especially left joins.

that means I have many missing values for customers who aren't present in the second table of the join.

For example, the variable indicating the number of times the customer has opened an email will be missing for those who have not an email.

So a missing value is a different information from a 0 which indicates that the customer recieved emails but didn't open them!

I replaced missing values with 0, but I think it's not optimal.

My question is is there a way to distinguish between missing values and 0, so that it can be taken into account by the model (logistic regression which don't accept missing values)

Any advice is appreciated!

Thanks,

MK

Reeza · Posted 12-01-2015 10:36 AM

It depends on your model. You could model a multinomial logistic model with a -1 being never received an email, 0 didn't open and 1 opened the email.

majdi_ka · Posted 12-02-2015 08:56 AM

@Reeza,

Thanks for your answer.

In fact, it's about my input variables and not my target variables.

It's what @JasonXin talked about!

JasonXin · Posted 12-01-2015 11:01 AM

MK,
This is a great question. In modeling communities, sometime you hear things like 'informativeness of the missing". This is what this question is about.

Suppose this is your code.

data merge2;
merge source1 (in=a) source(=b);
by accountnumber;
if a;
run;

Since you do left join, so I add the line "If a" to keep everybody from source1. Your question therefore is: for those in source2 but not in source1, since they never got emails. you are comfortable setting the variable to 0. But, there are a group of accounts (that are on both source1 and source2) who had opportunity to have receive emails, but did not respond. Their value, before mergeing, is set at 0. You DO NOT want to mix them with those who are not found on source1. One idea is this

data merge2;
merge source1(in=a) source2(in=b);
by accountnumber;
if a;
if a and not b the new_flag2=-1; else
if a and current_responseflag=1 then new_flag2=1;
else new_flag2=0;

Hope this helps. Best Regards. Jason Xin

majdi_ka · Posted 12-02-2015 09:23 AM

Jason,

It's exactly about 'informativeness of the missing". I didn't know about this name before.

You described the problem very well. Your suggested solution works very well if it's only about a flag (Binary or even nominal variable).

However, when the variable is numeric it's more complicated to deal with because -1 or even -9999 is considered as a value of the variable's range and not as an extra independent information. this will effect the estimated parameter of the variable. Not sure it's relevant.

I think that the model can distinguish between both populations (those who don't have an email and those who didn't open emails) as long as a variable "flag email" is present in the final model. This is because it takes into account different interactions between variables...But this it remains a hypothesis.

I'm not sure if there is a solution that sas stat or SAS Enterprise Miner consider. But it could be a format like this

Proc format;
     value nb_opened_email_fmt . = "No email";
run;

Or may be a special missing value.

I didn't try anyone of them yet but it can solve the problem.

Best Regards,

MK

missing values VS 0 in a logistic regression