BookmarkSubscribeRSS Feed
RobertNYC
Obsidian | Level 7

 

Hi all,

I’m using SAS Stat and Proc Logistic to build some basic product (retail) propensity models.  These questions have more to do with an issue that has developed with the data I’m using for some of these  “on the shelf” logistic regression models. I thought this was a good place to get some initial advice on how to handle this issue.

 

In a nutshell, the customer IDs I use to base the model build samples on models are not an accurate representation of customers. There is a bunch of customers which were assigned more than one customer ID (more than one email, more than one address…issues like that). So, two customer IDs  could actually be one “customer”.

 

Since I built my propensity models using Customer ID, that means I only modeled on portions of customer behavior, and duplicated customers as well.

 

Here are my questions:

  • Should I rebuild the models, once the data is corrected?
  • Should I do some validation work on the existing models now, by combining all associated cust Ids together, creating a new identifier, and creating gains charts?
  • If the data issue cannot be corrected is there a way to take these duplicate model IDs into account when I rebuild or create new models?

Any feedback will be greatly appreciated! Thanks !  

3 REPLIES 3
Reeza
Super User

An assumption for regression is independence between observations. 

The ID issues violates this assumption, so yes, you should fix it, if possible.

 

 

RobertNYC
Obsidian | Level 7

Thanks! If the issues can not be corrected. Do you have any suggestions on how to take this issue into account when buliding new models based on this data? 

SteveDenham
Jade | Level 19

Please don't think this is a flippant answer, but if you cannot set up the ID's as independent, then I would strongly suggest that you not build new models from the data, but rather spend your available time and money on collecting usable data.

 

However, if that really can't be done, then some sort of hierarchical modeling might be attempted, regarding the multiple IDs per unique customer.  If the unique identifier can be found, then you might consider the multiple measures as a repeated measure on the individual.  From there, it gets considerably murkier, as model selection procedures in the mixed model realm are not easily defined.  You will have to depend on subject knowledge more than you may want.

 

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1537 views
  • 1 like
  • 3 in conversation