BookmarkSubscribeRSS Feed
SAS_VA_Learner
Fluorite | Level 6

Hi,

 

Let me explain my situation :  I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset. With this dataset of 61279 records, I have the option of splitting it into 70:30 ratio between TRAIN (to develop the model) and IN-TIME-VALIDATION (to validate the model).  I work for a client and I do not have the option of obtaining other data apart from these 61279 records for me to do a OUT-OF-TIME-VALIDATION.

 

I am listing down a couple of options and my questions regarding them. Would request different opinions/thoughts/recommendations.

 

1) Option -1 : My case is of rare events, hence I can follow the technique for over sampling of rare events. In order to have maximum number of records to build the model, the most logical approach could be to opt for a oversampling rate of 5.77%, which could give me a total of 6487 records with all of 374 events.

 

As I cannot obtain extra data for OUT-OF-TIME-VALIDATION ,with a 5.77 oversampled rate, my dataset shall have a total of 6487 records.Is it recommended to to go for a 70:30 split between TRAIN and IN-TIME-VALIDATION , by which I can TRAIN  my model on 4541 records and perform IN-TIME-VALIDATION on 1946 records OR TRAIN my model solely on 6487 records ?

 

2) Option 2 : This option is NOT opting for over sampling of rare events as I cannot obtain extra data for OUT-OF-TIME-VALIDATION.

This option pertains to using PROC LOGISTIC (Maximum Likelihood). I  now have the option of a 70:30 split between TRAIN and IN-TIME-VALIDATION , by which I can TRAIN  my model on 42896 records and perform IN-TIME-VALIDATION  on 18383 records.

 

-- As I am doing nothing for oversampling of rare events and also cannot obtain extra data for OUT-OF-TIME-VALIDATION, please do let me know if it is advisable to TRAIN the model solely with 61279 records and not opting for IN-TIME-VALIDATION   OR TRAIN the model with 42896 records and opt for IN-TIME-VALIDATION with 18383 records ?

 

3) Option -3 : This option is also NOT opting for over sampling of rare events as I cannot obtain extra data for OUT-OF-TIME-VALIDATION. This option pertains to use of PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.Hence, I  now have the option of a 70:30 split between TRAIN and IN-TIME-VALIDATION , by which I can TRAIN  my model on 42896 records and perform IN-TIME-VALIDATION  on 18383 records.

 

-- As I am doing nothing for oversampling of rare events and also cannot obtain extra data for OUT-OF-TIME-VALIDATION, please do let me know if it is advisable to TRAIN the model solely with 61279 records and not opting for IN-TIME-VALIDATION   OR TRAIN the model with 42896 records and opt for IN-TIME-VALIDATION with 18383 records ?


Thanks
Surajit

12 REPLIES 12
SAS_VA_Learner
Fluorite | Level 6

Hi,

Let me explain my situation :

 

1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset.

 

2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables.

 

Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ?


3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.


Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ?

 

4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 .


My thoughts on Option-1,Option-2 or Option-3 as given below :

 

-- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records.

 

-- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records.

 

Regarding Option-1,Option-2 or Option-3 , Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ?

 


Thanks
Surajit

Ksharp
Super User

I would pick up OPTION 3.

But ratio should at least 10%, otherwise there would suffer overdisperse problem.

After oversampling ,there is no need to split raw data into TRAIN and VALIDATION due to small sample .

 

option 4:  try EXACT Logistic , but you would cost you lots of time ,and doubt you could finish it .

option 5: try other distribution, like Possion Distribution or Negative Binomial Distribution, check PROC GENMOD .

SAS_VA_Learner
Fluorite | Level 6

Hi Ksharp,

 

                             Thanks a ton for your inputs. 

1) The suggestion of atleast 10% oversampling rate and not to split data between TRAIN and INTIME Validation is really appreciable.

 

2) But with regards to a Binary Response Variable and the overall objective of rank ordering accounts based on probability value obtained , is using PROC GENMOD recommended along with looking into other distributions like POISSION and NEGATIVE BINOMIAL?

 

Thanks

Surajit

Ksharp
Super User

Sorry. I don't understand your question.

 distributions like POISSION and NEGATIVE BINOMIAL is for the sparse event ratio, just as yours. but they have limited event Prob too. 

SAS_VA_Learner
Fluorite | Level 6

Hi Ksharp,

 

                 When events are rare, the Poisson distribution provides a good approximation to the binomial distribution. But it’s still just an approximation, so it’s better to go with the binomial distribution, which is the basis for logistic regression.

 

               Hence, I was asking about the basis of your recommendation of using Poisson distribution in case of rare events.

 

Thanks

Surajit

Ksharp
Super User

"the basis of your recommendation"

The answer is simple,  Poisson/Negative distribution is better than Binomial ,you could check BIC or AIC between these model.

 

But I would pick up your option 3 ,if event ratio is too small .

SAS_VA_Learner
Fluorite | Level 6

Hi Ksharp,

 

                    I am unable to convince my self with the idea of Poission Regression or Negative Binomial Regression for Binary Outcomes as generally Poission Regression or Negative Binomial Regression is used for COUNT data and in my case , I have a Binary Response , which is not COUNT data.

 

                  But still then , I understand that if the Response Variable follows Poission Distribution or is something similar , modeling can still be done using PROC GENMOD opting for Poission Distribution or Negative Binomial Distribution. Would request your thoughts on this please. 

 

 

Thanks

Surajit

SAS_VA_Learner
Fluorite | Level 6

Hi Ksharp,

 

I am working for a client and my situation does not enable me to go for Option -3 . I say so because , in Option-3 , Out-Of-Sample Validation becomes the only option for validating the model and client is not ok with providing data for out of time validation testing.

 

Hence, with Option-3 ruled out, I need to go with either Option-1 or Option-2 only as listed below :

 

Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables.

 

 

Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.

 

Questions for Option-1 :

 

-- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way.

 

-- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ?

 

Questions for Option-2 :

 

-- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way.

 

-- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ?

 

 

 

 

Thanks

Surajit

Ksharp
Super User

Sorry I have no time to go through your question, I am not the expert about it either.

Maybe @StatDave or @sld  could shed you a light .

SAS_VA_Learner
Fluorite | Level 6

Hi Ksharp,

 

                  Thanks a ton for your reply. In order for  @StatDave_sas or @sld  to shed light, if possible . can you please let me know if I need to tag the question appropriately or put it in a different community/discussion group ?

 

Thanks

Surajit

Ksharp
Super User

I think you could post your question  at  Statistic Forum ,since it is about PROC LOGISTIC .

And @StatDave would give a first stop at Statistic Forum .

ChrisHemedinger
Community Manager

Just a housekeeping note -- we've merged the discussion from its original home on the Data Mining board -- in case any of the interchanges seem confusing 😉

Learn from the Experts! Check out the huge catalog of free sessions in the Ask the Expert webinar series.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 3759 views
  • 0 likes
  • 3 in conversation