- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I’m using SAS Stat and Proc Logistic to build some basic product (retail) propensity models. These questions have more to do with an issue that has developed with the data I’m using for some of these “on the shelf” logistic regression models. I thought this was a good place to get some initial advice on how to handle this issue.
In a nutshell, the customer IDs I use to base the model build samples on models are not an accurate representation of customers. There is a bunch of customers which were assigned more than one customer ID (more than one email, more than one address…issues like that). So, two customer IDs could actually be one “customer”.
Since I built my propensity models using Customer ID, that means I only modeled on portions of customer behavior, and duplicated customers as well.
Here are my questions:
- Should I rebuild the models, once the data is corrected?
- Should I do some validation work on the existing models now, by combining all associated cust Ids together, creating a new identifier, and creating gains charts?
- If the data issue cannot be corrected is there a way to take these duplicate model IDs into account when I rebuild or create new models?
Any feedback will be greatly appreciated! Thanks !
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
An assumption for regression is independence between observations.
The ID issues violates this assumption, so yes, you should fix it, if possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! If the issues can not be corrected. Do you have any suggestions on how to take this issue into account when buliding new models based on this data?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please don't think this is a flippant answer, but if you cannot set up the ID's as independent, then I would strongly suggest that you not build new models from the data, but rather spend your available time and money on collecting usable data.
However, if that really can't be done, then some sort of hierarchical modeling might be attempted, regarding the multiple IDs per unique customer. If the unique identifier can be found, then you might consider the multiple measures as a repeated measure on the individual. From there, it gets considerably murkier, as model selection procedures in the mixed model realm are not easily defined. You will have to depend on subject knowledge more than you may want.
Steve Denham