08-17-2015 04:12 PM
I am building a marketing model based on logistic regression. It's a customer attrition model. The event rate is very less i.e 0.1%. I have more than 1000 predictors. I know there is a rule - Minimum 10 events per predictor. I want to know - Does this rule exist before dimensionality reduction (feature extraction) with PCA and Information value? Should i consider this rule based on my original 1500 variables or does it exist for significant variables that came after applying variable selection techniques such as Stepwise Regression , PCA etc?
08-17-2015 04:23 PM
I understand this is a commonly asked question. But no one clarified the background of this rule. Does this rule consider correlated predictors? Should this rule apply before removing multicollinearity or after removing collinearity and feature extraction?
08-17-2015 04:41 PM
08-17-2015 05:26 PM
My nickel - and it's my [somewhat]educated opinion.
I would argue that the 10 per rule of thumb isn't always valid, it depends on the variability of the variables being measured.
If you're not using the event rate in your dimensionality reduction and variable selection I would argue the 'rule' would apply to the variables after reduction.
If before then to original variables.
08-18-2015 08:39 AM
According to some statistical expert , You need run EXACT Logistic regression . Check EXACT statement in proc logistic , If I remembered correctly.
08-18-2015 09:45 AM
Most variable selection techniques start by evaluating all one-variable models, then trying to add a second variable, a third variable, and so forth. If you want to conform to the 10-events-per-predictor rule, then you should not try to build models that have more than NumEvents / 10 predictors. For example, if you have 51 events, you could limit the selection algorithm to consider only models that have up to 5 continuous variables.
08-18-2015 10:51 AM
Thanks Xia and Rick for your reply. I am aware of Exact and Firth Logistic Regression. I am curious to know the background of this rule.
@ Rick - Suppose i have 1000 predictors in my model. Do you mean to say - It requires atleast 10k events before correcting for multicollinearity and feature extraction. I understand i can ignore this rule if i apply unsupervised learning (For e.g. PCA or PROC VARCLUS) as they are not related to dependent variable. I am more curious to know about supervised method to extract important variables. By supervised methods, i mean 'Information Value' and 'Chi-Square' methods. The model needs to have sufficient events for feature extractions? Otherwise the feature extraction would be biased. Correct?
08-18-2015 11:34 AM
Model building from 1000 predictors, using 'supervised methods' will be biased. The question is how biased, and will the model adequately predict future data. It is well known that naive methods lead to problematic results with standard regression models (stepwise, backward, forward, all possible subsets). See Flom and Cassell's paper on Stopping Stepwise http://www.lexjansen.com/pnwsug/2008/DavidCassell-StoppingStepwise.pdf
The problem is exacerbated for logistic regression. However, PROC HPGENSELECT in SAS/STAT14.1 does offer selection=LASSO which gets around a lot of the difficulties with the other methods. Still, consider the result of putting things on a logit link, and what might happen with fewer than 10 events per predictor. You are going to have some points with very small logits that have a lot of influence on the fit.
08-18-2015 11:40 AM
No, I said that if you apply this rule, then in going from 1,000 potential explanatory variables to the k that you want in your final model, that the (number of events)/10 will bound the value for k.
08-18-2015 12:20 PM
"Thanks Steve and Rick. @ Rick - " then in going from 1,000 potential explanatory variables to the k that you want in your final model, that the (number of events)/10 will bound the value for k. " - Would each of these 1000 variables have significant events to explain their variable importance? I suspect univariate analysis of these variables with dependent variable would fail. I am sorry to bug you again.
08-18-2015 01:25 PM
Here is a concrete example. Suppose in your training dataset you have 10,000 records with an event rate of 0.1%. That would be 10 events. Using the bounded value for k of events/10, you could adequately fit 1 variable to the data. If you had 20,000 records with the same event rate, you could adequately fit 2 variables, and so forth.
Of course, you will need additional records to validate your model against.
08-18-2015 02:27 PM
Thank you so much Steve for being so patient in replying this thread.:-) My question still lies in your explanation. I understand i can fit only 2 variables with 20k records with an event rate of 0.1%. My question - can i perform INITIAL feature extraction (important variables selection with supervised methods) to come up with 2 FINAL significant variable? Or Do i need more events to perform initial feature extraction step?
08-18-2015 03:41 PM
I think my original comment stands, if the feature extraction doesn't depend on the outcome you can use derived features as your variables - so can use 2 derived features with 20K records and an event rate of 0.1%.