Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Reply
Frequent Contributor
Posts: 96

Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Hi, I am planning to build a logistic regression model but my original dataset contains only 5 variables : - Gender - Region - Acorn group - Transactions (Recency, Frequency, Value) - Number of products What's the minimun number of variables required to build a descent model? In my previous experience, I would have at least 100 variables, then use the decision trees to select the most significant variables (top 10 for example) then run them in Logistic Regression using Stepwise. Many Thanks Alice

Super Contributor
Posts: 543

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Hi.

I don't believe there is a required minimum number of covariates in a model. I assume for 'effective' data mining, one may prefer a large number of covariates, but for you, given you have five covariates you could just run the logistic regression using the five covariates...there's probably no need to mine the data, first.

Ehehe, these are my thoughts.

Frequent Contributor
Posts: 96

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Posted in reply to AncaTilea

Many Thanks Anca...I would run the Logistic Regression and see what I get...

Super User
Posts: 10,023

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Or Variable Cluster Analysis to get a group of most making sense variables ?

Ksharp

Trusted Advisor
Posts: 2,115

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

The minimum number of predictors is 0.  With that degenerative model, the logistic regression becomes just a test of whether the outcome proportions are equal.

If you have one predictor, the logistic also has equivalents from  Stat 1.  If you have a binary outcome and a classification predictor, it is a type of chi-squared test.  If the classification variable is continuous, it is roughly the same as a z-test.  The calculations are slightly different, but the results are asymptotically equivalent.

Having a lot of predictors may get you a more precise model for that set of data (higher c-index), but it is not necessarily the most accurate or reproducible in general.  When you get new or additional data, the model with many variables may break down.  Harrell and others have done a lot of simulation work on the impact of over-specifying a model.  If you have lots of data (as we often do in data mining), it is less of an issue.

Doc Muhlbaier

Duke

Frequent Contributor
Posts: 96

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Many Thanks Duke! That's really helpful...and what about the size of the sample. For example my populatio is 100,000 and Cheers Alice

Frequent Contributor
Posts: 96

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Many Thanks Duke! That's really helpful...and what about the size of the sample. For example my original population is 110,000 and I am only building my model on 7,000..how robust the model will be? .. Basically, I am trying to build a WinBack Model (Customers who are more likely to reactivate). In total there are 110,000 who have lapsed (haven't made any transaction in the last 3 months). And I have a file of contacted customers (7,500)containing lapsed customers who have responded successfully (Winback=Yes) and other who haven't reactivated(Winback='No'). They have only called 7,000 customers to encourage them to come back. My question, the 7,000 that I have, is it sufficient to build a winback model?

Trusted Advisor
Posts: 2,115

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Alice,

14,000 is certainly enough to build a model.  It may or may not be informative, and only the validation will help on that.  One concern with 'region' is that it may have a lot of levels, so it may degenerate for some of them.

Another thing that you may be able to get to sharpen your estimates include an estimate of affluence (if you have addresses with ZIP+4, you can link that to some census files for mean income).  There is some evidence that winback campaigns work less well for those at the extremes of the socio-economic spectrum.

Doc

Frequent Contributor
Posts: 96

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Hi Doc, Thanks for your message. I only have 7,500 Not 14,000. Sorry maybe I didn't explain properly in my previous message. The lapse population is 110000, customers and I am building my model on 7,500 contacted customers. Many Thanks Alice

Trusted Advisor
Posts: 2,115

Re: Minimun number of variables to build a Logistic Regression Model. Help Please. Thank You

Might still be enough.  Where you can get into sample size trouble is if your categorical variables have lots of levels as the real determinant is the number of degrees of freedom, not the number of variables per se.

Ask a Question
Discussion stats
  • 9 replies
  • 342 views
  • 3 likes
  • 4 in conversation