04-12-2016 02:45 PM
I’m using SAS Stat and Proc Logistic to build some basic product (retail) propensity models. These questions have more to do with an issue that has developed with the data I’m using for some of these “on the shelf” logistic regression models. I thought this was a good place to get some initial advice on how to handle this issue.
In a nutshell, the customer IDs I use to base the model build samples on models are not an accurate representation of customers. There is a bunch of customers which were assigned more than one customer ID (more than one email, more than one address…issues like that). So, two customer IDs could actually be one “customer”.
Since I built my propensity models using Customer ID, that means I only modeled on portions of customer behavior, and duplicated customers as well.
Here are my questions:
Any feedback will be greatly appreciated! Thanks !
04-12-2016 03:02 PM
An assumption for regression is independence between observations.
The ID issues violates this assumption, so yes, you should fix it, if possible.
04-12-2016 03:14 PM
Thanks! If the issues can not be corrected. Do you have any suggestions on how to take this issue into account when buliding new models based on this data?
04-25-2016 10:06 AM
Please don't think this is a flippant answer, but if you cannot set up the ID's as independent, then I would strongly suggest that you not build new models from the data, but rather spend your available time and money on collecting usable data.
However, if that really can't be done, then some sort of hierarchical modeling might be attempted, regarding the multiple IDs per unique customer. If the unique identifier can be found, then you might consider the multiple measures as a repeated measure on the individual. From there, it gets considerably murkier, as model selection procedures in the mixed model realm are not easily defined. You will have to depend on subject knowledge more than you may want.