BookmarkSubscribeRSS Feed
Question
Fluorite | Level 6

Hi, I am planning to build a logistic regression model but my original dataset contains only 5 variables : - Gender - Region - Acorn group - Transactions (Recency, Frequency, Value) - Number of products What's the minimun number of variables required to build a descent model? In my previous experience, I would have at least 100 variables, then use the decision trees to select the most significant variables (top 10 for example) then run them in Logistic Regression using Stepwise. Many Thanks Alice

9 REPLIES 9
AncaTilea
Pyrite | Level 9

Hi.

I don't believe there is a required minimum number of covariates in a model. I assume for 'effective' data mining, one may prefer a large number of covariates, but for you, given you have five covariates you could just run the logistic regression using the five covariates...there's probably no need to mine the data, first.

Ehehe, these are my thoughts.

Question
Fluorite | Level 6

Many Thanks Anca...I would run the Logistic Regression and see what I get...

Ksharp
Super User

Or Variable Cluster Analysis to get a group of most making sense variables ?

Ksharp

Doc_Duke
Rhodochrosite | Level 12

The minimum number of predictors is 0.  With that degenerative model, the logistic regression becomes just a test of whether the outcome proportions are equal.

If you have one predictor, the logistic also has equivalents from  Stat 1.  If you have a binary outcome and a classification predictor, it is a type of chi-squared test.  If the classification variable is continuous, it is roughly the same as a z-test.  The calculations are slightly different, but the results are asymptotically equivalent.

Having a lot of predictors may get you a more precise model for that set of data (higher c-index), but it is not necessarily the most accurate or reproducible in general.  When you get new or additional data, the model with many variables may break down.  Harrell and others have done a lot of simulation work on the impact of over-specifying a model.  If you have lots of data (as we often do in data mining), it is less of an issue.

Doc Muhlbaier

Duke

Question
Fluorite | Level 6

Many Thanks Duke! That's really helpful...and what about the size of the sample. For example my populatio is 100,000 and Cheers Alice

Question
Fluorite | Level 6

Many Thanks Duke! That's really helpful...and what about the size of the sample. For example my original population is 110,000 and I am only building my model on 7,000..how robust the model will be? .. Basically, I am trying to build a WinBack Model (Customers who are more likely to reactivate). In total there are 110,000 who have lapsed (haven't made any transaction in the last 3 months). And I have a file of contacted customers (7,500)containing lapsed customers who have responded successfully (Winback=Yes) and other who haven't reactivated(Winback='No'). They have only called 7,000 customers to encourage them to come back. My question, the 7,000 that I have, is it sufficient to build a winback model?

Doc_Duke
Rhodochrosite | Level 12

Alice,

14,000 is certainly enough to build a model.  It may or may not be informative, and only the validation will help on that.  One concern with 'region' is that it may have a lot of levels, so it may degenerate for some of them.

Another thing that you may be able to get to sharpen your estimates include an estimate of affluence (if you have addresses with ZIP+4, you can link that to some census files for mean income).  There is some evidence that winback campaigns work less well for those at the extremes of the socio-economic spectrum.

Doc

Question
Fluorite | Level 6

Hi Doc, Thanks for your message. I only have 7,500 Not 14,000. Sorry maybe I didn't explain properly in my previous message. The lapse population is 110000, customers and I am building my model on 7,500 contacted customers. Many Thanks Alice

Doc_Duke
Rhodochrosite | Level 12

Might still be enough.  Where you can get into sample size trouble is if your categorical variables have lots of levels as the real determinant is the number of degrees of freedom, not the number of variables per se.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1213 views
  • 3 likes
  • 4 in conversation