04-22-2013 08:24 AM
Hi, I am planning to build a logistic regression model but my original dataset contains only 5 variables : - Gender - Region - Acorn group - Transactions (Recency, Frequency, Value) - Number of products What's the minimun number of variables required to build a descent model? In my previous experience, I would have at least 100 variables, then use the decision trees to select the most significant variables (top 10 for example) then run them in Logistic Regression using Stepwise. Many Thanks Alice
04-22-2013 09:17 AM
I don't believe there is a required minimum number of covariates in a model. I assume for 'effective' data mining, one may prefer a large number of covariates, but for you, given you have five covariates you could just run the logistic regression using the five covariates...there's probably no need to mine the data, first.
Ehehe, these are my thoughts.
04-22-2013 09:53 AM
The minimum number of predictors is 0. With that degenerative model, the logistic regression becomes just a test of whether the outcome proportions are equal.
If you have one predictor, the logistic also has equivalents from Stat 1. If you have a binary outcome and a classification predictor, it is a type of chi-squared test. If the classification variable is continuous, it is roughly the same as a z-test. The calculations are slightly different, but the results are asymptotically equivalent.
Having a lot of predictors may get you a more precise model for that set of data (higher c-index), but it is not necessarily the most accurate or reproducible in general. When you get new or additional data, the model with many variables may break down. Harrell and others have done a lot of simulation work on the impact of over-specifying a model. If you have lots of data (as we often do in data mining), it is less of an issue.
04-22-2013 10:09 AM
Many Thanks Duke! That's really helpful...and what about the size of the sample. For example my populatio is 100,000 and Cheers Alice
04-22-2013 10:25 AM
Many Thanks Duke! That's really helpful...and what about the size of the sample. For example my original population is 110,000 and I am only building my model on 7,000..how robust the model will be? .. Basically, I am trying to build a WinBack Model (Customers who are more likely to reactivate). In total there are 110,000 who have lapsed (haven't made any transaction in the last 3 months). And I have a file of contacted customers (7,500)containing lapsed customers who have responded successfully (Winback=Yes) and other who haven't reactivated(Winback='No'). They have only called 7,000 customers to encourage them to come back. My question, the 7,000 that I have, is it sufficient to build a winback model?
04-22-2013 12:58 PM
14,000 is certainly enough to build a model. It may or may not be informative, and only the validation will help on that. One concern with 'region' is that it may have a lot of levels, so it may degenerate for some of them.
Another thing that you may be able to get to sharpen your estimates include an estimate of affluence (if you have addresses with ZIP+4, you can link that to some census files for mean income). There is some evidence that winback campaigns work less well for those at the extremes of the socio-economic spectrum.
04-22-2013 01:10 PM
Hi Doc, Thanks for your message. I only have 7,500 Not 14,000. Sorry maybe I didn't explain properly in my previous message. The lapse population is 110000, customers and I am building my model on 7,500 contacted customers. Many Thanks Alice
04-22-2013 02:40 PM
Might still be enough. Where you can get into sample size trouble is if your categorical variables have lots of levels as the real determinant is the number of degrees of freedom, not the number of variables per se.