Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Re: Appropriate to use firth method in proc logistic for rare events?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 02-07-2013 11:26 PM
(2511 views)

Hi,

I am trying to perform logistic regression but am facing rare events (~0.07%) out of a total sample of 200,000+ observations. I understand that one method is to perform stratified sampling. But I also read that Firth method is possible too? (Logistic Regression for Rare Events | Statistical Horizons)

Can I check if Firth method is appropriate for rare events?

2 REPLIES 2

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Let me explain my situation :

1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset.

2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables.

Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ?

3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.

Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ?

4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 .

My thoughts on Option-1,Option-2 or Option-3 as given below :

-- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records.

-- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records.

Regarding Option-1,Option-2 or Option-3 , Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ?

Thanks

Surajit

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 25. Read more here about **why** you should contribute and **what is in it** for you!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.