Statistical Procedures

RobertNYC · Posted 04-12-2016 02:45 PM

Hi all,

I’m using SAS Stat and Proc Logistic to build some basic product (retail) propensity models. These questions have more to do with an issue that has developed with the data I’m using for some of these “on the shelf” logistic regression models. I thought this was a good place to get some initial advice on how to handle this issue.

In a nutshell, the customer IDs I use to base the model build samples on models are not an accurate representation of customers. There is a bunch of customers which were assigned more than one customer ID (more than one email, more than one address…issues like that). So, two customer IDs could actually be one “customer”.

Since I built my propensity models using Customer ID, that means I only modeled on portions of customer behavior, and duplicated customers as well.

Here are my questions:

Should I rebuild the models, once the data is corrected?
Should I do some validation work on the existing models now, by combining all associated cust Ids together, creating a new identifier, and creating gains charts?
If the data issue cannot be corrected is there a way to take these duplicate model IDs into account when I rebuild or create new models?

Any feedback will be greatly appreciated! Thanks !

Reeza · Posted 04-12-2016 03:02 PM

An assumption for regression is independence between observations.

The ID issues violates this assumption, so yes, you should fix it, if possible.

RobertNYC · Posted 04-12-2016 03:14 PM

Thanks! If the issues can not be corrected. Do you have any suggestions on how to take this issue into account when buliding new models based on this data?

SteveDenham · Posted 04-25-2016 10:06 AM

Please don't think this is a flippant answer, but if you cannot set up the ID's as independent, then I would strongly suggest that you not build new models from the data, but rather spend your available time and money on collecting usable data.

However, if that really can't be done, then some sort of hierarchical modeling might be attempted, regarding the multiple IDs per unique customer. If the unique identifier can be found, then you might consider the multiple measures as a repeated measure on the individual. From there, it gets considerably murkier, as model selection procedures in the mixed model realm are not easily defined. You will have to depend on subject knowledge more than you may want.

Steve Denham

Statistical Procedures

Proc Logistic: Rebuild models when errors in build data are discovered?

Re: Proc Logistic: Rebuild models when errors in build data are discovered?

Re: Proc Logistic: Rebuild models when errors in build data are discovered?

Re: Proc Logistic: Rebuild models when errors in build data are discovered?

Follow Us

What is...

Statistical Procedures

Proc Logistic: Rebuild models when errors in build data are discovered?

Re: Proc Logistic: Rebuild models when errors in build data are discovered?

Re: Proc Logistic: Rebuild models when errors in build data are discovered?

Re: Proc Logistic: Rebuild models when errors in build data are discovered?

Special offer for SAS Communities members

Follow Us

What is...