Hi Doug, Thanks a ton for your inputs. Just trying to summarize from your thoughts : 1) PROC LOGISTIC (Conventional MLE Estimates) & PROC LOGISTIC (Firth's Penalized Maximum Likelihood) : Not a viable option as per your comment as given below to which I fully agree. But at the same time , just wanted to know if I go with this approach then is a 70:30 split advisable between TRAIN and IN TIME VALIDATION considering the low response rate of 0.6% or TRAIN and OUT OF TIME VALIDATION only recommended ? "I tend to think in terms of how many actual events I have rather than how many total records I have. If I have 100 events and a 100 non-events, I have 200 observations. If I have 100 events and 99,900 non-events, do I really have that much more information? The signal in that case is so low (0.1%) that it would be difficult to have much confidence in any fitted model." 2) PROC LOGISTIC (Oversampled Rate of 5.77 %) : Splitting into TRAIN and INTIME VALIDATION is not recommended as per your comment as given below to which I fully agree. As INTIME VALIDATION is not recommended , then is OUT OF TIME VALIDATION the only option for model testing in this scenario ? "* The total number of events (374) is so low that I would consider not even splitting the data in this situation. Data Mining methods like partitioning assume that there are sufficient observations to represent the population in every partitioned data set. Splitting 70/30 leaves barely over 100 events in validation. Your data set might be better handle by classical statistical approaches given the limited data available. " 3) PROC LOGISTIC (Oversampled Rate of 5.77 %) : Apart from using a decision tree to understand more about the data , is there any other suggestion with regards to use of classical statistical approaches given the limited data available ? Thanks Surajit
... View more