07-09-2015 10:59 AM
I would like to build a logistic regression model but I only have 500 responders and 3 Millions non responders! I have always been told that we need at least 1000 responders to have a decent model? Is there any solution for this? Will generating number of randoms samples based on 500 responders and add them into the modelelling dataset to get to 1000 responders work? (similar to boot-strapping sampling)
• Say create random 20% samples (i.e. select 20% of the universe) with the conversion rate of 0.02% I will need about 10 random samples to get to the response count of 1000
• For each sample I will need to create different dummy address ids for the responders – e.g. actual address_id||1 (for the 1st sample), and then address_id||2(for the 2nd sample)
• That way I will get a pool of about 1000 responders
• And then randomly select 1000 non-responders.
Your help would be much appreciated.
07-09-2015 12:11 PM
See some of the Suggested Answers in the MORE LIKE THIS section on the right hand side of your question.
There's a rule of thumb for responders to number of variables, bayesian estimates to resample, simulation methods that are options.
PS. I've never heard of the rule of 1000 responders in 10 years of stats, so it may be specific to your field perhaps?
07-09-2015 12:34 PM
I've always performed logistic regression based on the response rate and not the total number of responses. For instance, if you have a data set of 100,000 observations and of that 10,000 have a response (i.e. 10% response rate) you're just fine. However, if you have a data set of 100,000 observations and only 100 have a response (i.e. 0.1% response rate) then you have a problem using regular logistic regression because of the issue with the maximum likelihood estimation suffering a degree of bias.
This is a good article from Paul Allen, who I really like and own a couple of his books with logistic regression, that discusses techniques to be used in this case.
07-09-2015 02:31 PM
http://gking.harvard.edu/files/gking/files/0s.pdf This seems to be key paper referred to by Paul Allen.
He was right...it is subject to mis-interpretation...as are all academic stats papers when used to answer sampling bias questions.
Graphic Visualizations certainly can help as can ad-hoc help systems in software.
How can JMP visualize the bias issues and also define the sampling complexity scenario in simpler terms perhaps with case examples?
07-10-2015 05:03 AM
I would suggest to do oversampling (select all the responsders and part of non responders) and later on correct for bias due to oversampling
07-10-2015 08:05 AM
OU. That is a good reason to use Possion Regression.
Take a look at Logistic Link function:
log(p/(1-p)) , if p ~ 0 then ==> log(p) , it is exactly the Possion Regression's Link function .
Or use negative binomial distribution .
Check proc genmod ,you can use both of these distribution .
BTW, There is a EXACT statement in proc logistic, you can use it for small sample data .
Also consider using Montal Carlo method ,which is also valuable in proc logistic.