Appropriate to use firth method in proc logistic for rare events?

lavernal · Posted 02-07-2013 11:26 PM

Hi,

I am trying to perform logistic regression but am facing rare events (~0.07%) out of a total sample of 200,000+ observations. I understand that one method is to perform stratified sampling. But I also read that Firth method is possible too? (Logistic Regression for Rare Events | Statistical Horizons)

Can I check if Firth method is appropriate for rare events?

ranjan_mitre_org · Posted 04-08-2014 09:58 AM

You might want to check out the paper by King and Zeng, "Logistic Regression in Rare Events Data" that addresses the rare events problem and also cites Firth's paper. I am interested in knowing how you have progressed with the modeling of the rare data, as I have a similar extremely rare events data to process.

SAS_VA_Learner · Posted 07-26-2018 07:04 PM

Hi,

Let me explain my situation :

1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset.

2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables.

Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ?

3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.

Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ?

4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 .

My thoughts on Option-1,Option-2 or Option-3 as given below :

-- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records.

-- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records.

Regarding Option-1,Option-2 or Option-3 , Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ?

Thanks
Surajit

Appropriate to use firth method in proc logistic for rare events?

Re: Appropriate to use firth method in proc logistic for rare events?

Re: Appropriate to use firth method in proc logistic for rare events?

Catch up on SAS Innovate 2026