BookmarkSubscribeRSS Feed
majdi_ka
Obsidian | Level 7

Hi,

 

I built a data set for scoring purpose. And it needed multiple merges between many tables especially left joins.

that means I have many missing values for customers who aren't present in the second table of the join.

 

For example, the variable indicating the number of times the customer has opened an email will be missing for those who have not an email.

So a missing value is a different information from a 0 which indicates that the customer recieved emails but didn't open them!

 

I replaced missing values with 0, but I think it's not optimal.

 

My question is is there a way to distinguish between missing values and 0, so that it can be taken into account by the model (logistic regression which don't accept missing values)

 

Any advice is appreciated!

 

 

Thanks,

 

MK

4 REPLIES 4
Reeza
Super User
It depends on your model. You could model a multinomial logistic model with a -1 being never received an email, 0 didn't open and 1 opened the email.
majdi_ka
Obsidian | Level 7

@Reeza,

 

Thanks for your answer.

 

In fact, it's about my input variables and not my target variables.

 

It's what @JasonXin talked about!

 

 

 

JasonXin
SAS Employee
MK,
This is a great question. In modeling communities, sometime you hear things like 'informativeness of the missing". This is what this question is about.

Suppose this is your code.

data merge2;
merge source1 (in=a) source(=b);
by accountnumber;
if a;
run;

Since you do left join, so I add the line "If a" to keep everybody from source1. Your question therefore is: for those in source2 but not in source1, since they never got emails. you are comfortable setting the variable to 0. But, there are a group of accounts (that are on both source1 and source2) who had opportunity to have receive emails, but did not respond. Their value, before mergeing, is set at 0. You DO NOT want to mix them with those who are not found on source1. One idea is this

data merge2;
merge source1(in=a) source2(in=b);
by accountnumber;
if a;
if a and not b the new_flag2=-1; else
if a and current_responseflag=1 then new_flag2=1;
else new_flag2=0;

Hope this helps. Best Regards. Jason Xin
majdi_ka
Obsidian | Level 7

Jason,

 

It's exactly about 'informativeness of the missing". I didn't know about this name before. 

 

You described the problem very well. Your suggested solution works very well if it's only about a flag (Binary or even nominal variable).

 

However, when the variable is numeric it's more complicated to deal with because -1 or even -9999 is considered as a value of the variable's range and not as an extra independent information. this will effect the estimated parameter of the variable. Not sure it's relevant.

 

I think that the model can distinguish between both populations (those who don't have an email and those who didn't open emails) as long as a variable "flag email" is present in the final model. This is because it takes into account different interactions between variables...But this it remains a hypothesis.

 

I'm not sure if there is a solution that sas stat or SAS Enterprise Miner consider. But it could be a format like this

Proc format;
     value nb_opened_email_fmt . = "No email";
run;

Or may be a special missing value.

 

I didn't try anyone of them yet but it can solve the problem. 

 

Best Regards,

MK

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 941 views
  • 0 likes
  • 3 in conversation