About SAS_VA_Learner

SAS_VA_Learner · ‎08-08-2018

Hi, Let me explain my situation : I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset. With this dataset of 61279 records, I have the option of splitting it into 70:30 ratio between TRAIN (to develop the model) and IN-TIME-VALIDATION (to validate the model). I work for a client and I do not have the option of obtaining other data apart from these 61279 records for me to do a OUT-OF-TIME-VALIDATION. I am listing down a couple of options and my questions regarding them. Would request different opinions/thoughts/recommendations. 1) Option -1 : My case is of rare events, hence I can follow the technique for over sampling of rare events. In order to have maximum number of records to build the model, the most logical approach could be to opt for a oversampling rate of 5.77%, which could give me a total of 6487 records with all of 374 events. As I cannot obtain extra data for OUT-OF-TIME-VALIDATION ,with a 5.77 oversampled rate, my dataset shall have a total of 6487 records.Is it recommended to to go for a 70:30 split between TRAIN and IN-TIME-VALIDATION , by which I can TRAIN my model on 4541 records and perform IN-TIME-VALIDATION on 1946 records OR TRAIN my model solely on 6487 records ? 2) Option 2 : This option is NOT opting for over sampling of rare events as I cannot obtain extra data for OUT-OF-TIME-VALIDATION. This option pertains to using PROC LOGISTIC (Maximum Likelihood). I now have the option of a 70:30 split between TRAIN and IN-TIME-VALIDATION , by which I can TRAIN my model on 42896 records and perform IN-TIME-VALIDATION on 18383 records. -- As I am doing nothing for oversampling of rare events and also cannot obtain extra data for OUT-OF-TIME-VALIDATION, please do let me know if it is advisable to TRAIN the model solely with 61279 records and not opting for IN-TIME-VALIDATION OR TRAIN the model with 42896 records and opt for IN-TIME-VALIDATION with 18383 records ? 3) Option -3 : This option is also NOT opting for over sampling of rare events as I cannot obtain extra data for OUT-OF-TIME-VALIDATION. This option pertains to use of PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators.Hence, I now have the option of a 70:30 split between TRAIN and IN-TIME-VALIDATION , by which I can TRAIN my model on 42896 records and perform IN-TIME-VALIDATION on 18383 records. -- As I am doing nothing for oversampling of rare events and also cannot obtain extra data for OUT-OF-TIME-VALIDATION, please do let me know if it is advisable to TRAIN the model solely with 61279 records and not opting for IN-TIME-VALIDATION OR TRAIN the model with 42896 records and opt for IN-TIME-VALIDATION with 18383 records ? Thanks Surajit

SAS_VA_Learner · ‎08-07-2018

Hi Ksharp, Thanks a ton for your reply. In order for @StatDave_sas or @sld to shed light, if possible . can you please let me know if I need to tag the question appropriately or put it in a different community/discussion group ? Thanks Surajit

SAS_VA_Learner · ‎08-06-2018

Hi Ksharp, I am working for a client and my situation does not enable me to go for Option -3 . I say so because , in Option-3 , Out-Of-Sample Validation becomes the only option for validating the model and client is not ok with providing data for out of time validation testing. Hence, with Option-3 ruled out, I need to go with either Option-1 or Option-2 only as listed below : Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables. Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators. Questions for Option-1 : -- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way. -- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ? Questions for Option-2 : -- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way. -- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ? Thanks Surajit

SAS_VA_Learner · ‎08-06-2018

Hi Ksharp, I am unable to convince my self with the idea of Poission Regression or Negative Binomial Regression for Binary Outcomes as generally Poission Regression or Negative Binomial Regression is used for COUNT data and in my case , I have a Binary Response , which is not COUNT data. But still then , I understand that if the Response Variable follows Poission Distribution or is something similar , modeling can still be done using PROC GENMOD opting for Poission Distribution or Negative Binomial Distribution. Would request your thoughts on this please. Thanks Surajit

SAS_VA_Learner · ‎08-06-2018

Hi Doug, I am working for a client and my situation does not enable me to go for Option -3 . I say so because , in Option-3 , Out-Of-Sample Validation becomes the only option for validating the model and client is not ok with providing data for out of time validation testing. Hence, with Option-3 ruled out, I need to go with either Option-1 or Option-2 only as listed below : Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables. Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators. Questions for Option-1 : -- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way. -- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ? Questions for Option-2 : -- With a response rate is 0.6% (374 events in a total of 61279 records), is it advisable to split 61279 records in 70:30 ratio for Model Build and In-time Validation ? I know that if I split 61279 records in 70:30 ratio for Model Build and In-time Validation, I am reducing the numbers of responses in the Model Build Dataset from 374 to 262. I am forced to think in this way as I am unable to obtain any data for Out-Of-Time Validation Testing and if I do not opt to to split 61279 records in 70:30 ratio for Model Build and In-time Validation, I cannot validate my model in any way. -- Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ? Thanks Surajit

SAS_VA_Learner · ‎07-30-2018

Hi Ksharp, When events are rare, the Poisson distribution provides a good approximation to the binomial distribution. But it’s still just an approximation, so it’s better to go with the binomial distribution, which is the basis for logistic regression. Hence, I was asking about the basis of your recommendation of using Poisson distribution in case of rare events. Thanks Surajit

SAS_VA_Learner · ‎07-29-2018

Hi Ksharp, Thanks a ton for your inputs. 1) The suggestion of atleast 10% oversampling rate and not to split data between TRAIN and INTIME Validation is really appreciable. 2) But with regards to a Binary Response Variable and the overall objective of rank ordering accounts based on probability value obtained , is using PROC GENMOD recommended along with looking into other distributions like POISSION and NEGATIVE BINOMIAL? Thanks Surajit

SAS_VA_Learner · ‎07-29-2018

Hi Doug, Thanks a ton for your inputs. Just trying to summarize from your thoughts : 1) PROC LOGISTIC (Conventional MLE Estimates) & PROC LOGISTIC (Firth's Penalized Maximum Likelihood) : Not a viable option as per your comment as given below to which I fully agree. But at the same time , just wanted to know if I go with this approach then is a 70:30 split advisable between TRAIN and IN TIME VALIDATION considering the low response rate of 0.6% or TRAIN and OUT OF TIME VALIDATION only recommended ? "I tend to think in terms of how many actual events I have rather than how many total records I have. If I have 100 events and a 100 non-events, I have 200 observations. If I have 100 events and 99,900 non-events, do I really have that much more information? The signal in that case is so low (0.1%) that it would be difficult to have much confidence in any fitted model." 2) PROC LOGISTIC (Oversampled Rate of 5.77 %) : Splitting into TRAIN and INTIME VALIDATION is not recommended as per your comment as given below to which I fully agree. As INTIME VALIDATION is not recommended , then is OUT OF TIME VALIDATION the only option for model testing in this scenario ? "* The total number of events (374) is so low that I would consider not even splitting the data in this situation. Data Mining methods like partitioning assume that there are sufficient observations to represent the population in every partitioned data set. Splitting 70/30 leaves barely over 100 events in validation. Your data set might be better handle by classical statistical approaches given the limited data available. " 3) PROC LOGISTIC (Oversampled Rate of 5.77 %) : Apart from using a decision tree to understand more about the data , is there any other suggestion with regards to use of classical statistical approaches given the limited data available ? Thanks Surajit

SAS_VA_Learner · ‎07-26-2018

Hi, Let me explain my situation : 1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset. 2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ? 3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ? 4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 . My thoughts on Option-1,Option-2 or Option-3 as given below : -- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records. -- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records. Regarding Option-1,Option-2 or Option-3 , Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ? Thanks Surajit

SAS_VA_Learner · ‎07-26-2018

Hi, Let me explain my situation : 1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset. 2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ? 3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ? 4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 . My thoughts on Option-1,Option-2 or Option-3 as given below : -- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records. -- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records. Regarding Option-1,Option-2 or Option-3 , Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ? Thanks Surajit

SAS_VA_Learner · ‎07-26-2018

Hi, Let me explain my situation : 1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset. 2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ? 3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ? 4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 . My thoughts on Option-1,Option-2 or Option-3 as given below : -- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records. -- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records. Regarding Option-1,Option-2 or Option-3 , Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ? Thanks Surajit

SAS_VA_Learner · ‎07-26-2018

Hi Doug, Thanks a ton for the very good explanation. Let me explain my situation : 1) I have a dataset - where the response rate is 0.6% (374 events in a total of 61279 records) and I need to build a logistic regression model on this dataset. 2) Option 1 : I can go with PROC LOGISTIC (conventional Maximum Likelihood) as the thumb rule " that you should have at least 10 events for each parameter estimated" should hold good considering that I start my model build iteration with not more than 35 variables and finalize the model build with less than 10 variables. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (conventional ML) with the understanding that I may have to do certain categorical level collapses to rule out cases of quasi complete separation/ complete separation and considering the thumb rule " that you should have at least 10 events for each parameter estimated" ? 3) Option -2 : I can go with PROC LOGISTIC (Firth's Method using Penalized Likelihood) - The Firth method could be helpful in reducing any small-sample bias of the estimators. Please do let me know if I have more than 35 predictors initially to start the model build process and if it is recommended to use PROC LOGISTIC (Firth's Method using Penalized Likelihood) with the understanding that I DO NOT have to do any categorical level collapses to rule out cases of quasi complete separation/ complete separation ? 4) Option -3 : If the above 2 options is not recommended , then the last option is to use the strategy for Over sampling of rare events. As the total number of events-374 and total records-61279 are both quite less with regards to posing any challenges on computing time or on hardware, I would obviously go with a oversampling rate of 5% only (Number of records to be modelled=6487) as I want to consider as many non-event records as possible as if I go for oversampling rate above 5% , the total number of records that can be modeled is less than 6487 . My thoughts on this process as given below : -- With a 5.77 oversampled rate, number of events = 374 and number of non-events=6113, a total of 6487 records. with a 70:30 split between TRAIN and VALIDATION , I can build my model on 4541 records and perform intime validation on 1946 records. -- Comparing to Option-1 and Option-2, with a 70:30 split between TRAIN and VALIDATION , I can build my model on 42896 records and perform intime validation on 18383 records. Please do help me with which option is recommended for me - Option-1,Option-2 or Option-3 in my case ? If Option-3, then is it recommended to use a oversampling rate of either 2% or 3% in order to increase the number of records to be modeled to something above 6487 ? Thanks Surajit

SAS_VA_Learner · ‎07-25-2018

Hi, I true appreciate the thought that having an over sampled data set with a 50/50 ratio of events/non/events is not recommended in real world scenarios which basically eliminates a lot of non-events and the model is unable to learn about non-events. If 50/50 ratio of events/non/events is always not the right approach to model rare events because of above, any guidelines on what the right approach is considering total number of records in the dataset (non oversampled) ? Thanks Surajit

SAS_VA_Learner · ‎08-08-2016

Hi Harshil, This is regarding preparing for the SAS Visual Analytics certification exam preparation. You have talked about 2 books : Volume-1 and Volume-2. Can you explicity name the book please ? Thanks Surajit

SAS_VA_Learner · ‎10-22-2015

Hi, This is regarding the SAS Visual Analytics Fast Track Course which i had taken a couple of months ago. Now, to practice what is given in the course material, shall I find all the datasets and other required stuff at : http://www.sas.com/en_us/software/business-intelligence/visual-analytics/demo.html Thanks Surajit

Online Status	Offline
Date Last Visited	‎08-16-2018 08:15 PM

Logistic regression of rare events - Unable to use Oversampling for Ra...

Re: Logistic Regression of Rare Events

Re: Logistic Regression of Rare Events

Re: Logistic Regression of Rare Events

Re: Using proc logistic for rare events-low probability estimates on a...

Re: Logistic Regression of Rare Events

Re: Logistic Regression of Rare Events

Re: A Question on Modeling Rare Events Data

Logistic Regression of Rare Events

Re: Appropriate to use firth method in proc logistic for rare events?

Re: SAS VA Certification

Logistic regression of rare events - Unable to use Oversampling for Ra...

Re: Logistic Regression of Rare Events

Re: Logistic Regression of Rare Events

Re: Logistic Regression of Rare Events

Re: Using proc logistic for rare events-low probability estimates on a...

Re: Logistic Regression of Rare Events

Re: Logistic Regression of Rare Events

Re: A Question on Modeling Rare Events Data

Logistic Regression of Rare Events

Re: Appropriate to use firth method in proc logistic for rare events?

Re: A Question on Modeling Rare Events Data

Re: Using proc logistic for rare events-low probability estimates on a...

Re: Using proc logistic for rare events-low probability estimates on a...

Betreff: SAS VA Certification

Re: SAS VA Certification