Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

Reply
Frequent Contributor
Posts: 95

500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

Hi,

I would like to build a logistic regression model but I only have 500 responders and 3 Millions non responders! I have always been told that we need at least 1000 responders to have a decent model? Is there any solution for this? Will generating number of randoms samples based on 500 responders  and add them into the modelelling dataset to get to 1000 responders work? (similar to boot-strapping sampling)

•         Say create random 20% samples (i.e. select 20% of the universe) with the conversion rate of 0.02% I will need about 10 random samples to get to the response count of 1000

•         For each sample I will need to create different dummy address ids for the responders – e.g. actual address_id||1 (for the 1st sample), and then address_id||2(for the 2nd sample)

•         That way I will get a pool of about 1000 responders

•         And then randomly select 1000 non-responders.

Your help would be much appreciated.

Many Thanks

Super User
Posts: 17,819

Re: 500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

See some of the Suggested Answers in the MORE LIKE THIS section on the right hand side of your question.

There's a rule of thumb for responders to number of variables, bayesian estimates to resample, simulation methods that are options.

PS. I've never heard of the rule of 1000 responders in 10 years of stats, so it may be specific to your field perhaps?

Frequent Contributor
Posts: 130

Re: 500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

I've always performed logistic regression based on the response rate and not the total number of responses.  For instance, if you have a data set of 100,000 observations and of that 10,000 have a response (i.e. 10% response rate) you're just fine.  However, if you have a data set of 100,000 observations and only 100 have a response (i.e. 0.1% response rate) then you have a problem using regular logistic regression because of the issue with the maximum likelihood estimation suffering a degree of bias.

This is a good article from Paul Allen, who I really like and own a couple of his books with logistic regression, that discusses techniques to be used in this case.

http://statisticalhorizons.com/logistic-regression-for-rare-events

Good Luck!

Super User
Posts: 17,819

Re: 500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

Paul Allison - good reference for stats and SAS Smiley Happy

Occasional Contributor
Posts: 7

Re: 500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

http://gking.harvard.edu/files/gking/files/0s.pdf  This seems to be key paper referred to by Paul Allen.

He was right...it is subject to mis-interpretation...as are all academic stats papers when used to answer sampling bias questions.

Graphic Visualizations certainly can help as can ad-hoc help systems in software.

How can JMP visualize the bias issues and also define the sampling complexity scenario in simpler terms perhaps with case examples?

New Contributor
Posts: 3

Re: 500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

I would suggest to do oversampling (select all the responsders and part of non responders) and later on correct for bias due to oversampling

Super User
Posts: 9,676

Re: 500 Responders only? Is it sufficient to build a Logistic Regression model? Thank you

OU. That is a good reason to use Possion Regression.

Take a look at Logistic Link function:

log(p/(1-p)) , if p ~ 0 then ==> log(p) , it is exactly the Possion Regression's Link function .

Or use negative binomial distribution .

Check proc genmod ,you can use both of these distribution .

BTW, There is a EXACT statement in proc logistic, you can use it for small sample data .

Also consider using Montal Carlo method ,which is also valuable in proc logistic.

Xia Keshan

Ask a Question
Discussion stats
  • 6 replies
  • 690 views
  • 3 likes
  • 6 in conversation