Good day
I need to develop a logistic regression model. with 12 variables and 1 target variable. the target variable is the court hearing outcome of guilt and not guilty verdict.
What is the best data partitioning ratio for a data set of 1600 records in test and validation part?
What is best method to mitigate an unequal target variable with a split of "guilty verdict" = 1380 and "not guilty verdict" = 240?
You'll need to add some more stratification variables to that instead of a random sample, in my opinion at least.
Specifically around judges, because different judges are biased in certain directions you'll want to make sure that you have equal records in the three to balance it out.
Thanks.
But what ratio should I used? 50/50 or 55/45 or 60/40?
Take a look at some of your breakdowns and see what's feasible, it does also depend on how many variables you're using. For example, if you have 30 variables you need roughly 25 per variable to get a good estimate so the minimum size for any data set should be 725 - which is too big for your data set, so you'd have to reduce the number of variables or get more observations.
AFAIK, there isn't a hard and fast rule for splitting the data, though, 60/20/20 is what I've seen a lot these days. Don't forget the test data set and make sure to only use it for that. Unless you're doing CV but I would still recommend a neutral test data set.
@Sachin51 wrote:
Thanks.
But what ratio should I used? 50/50 or 55/45 or 60/40?
I only need to split the data into train and validation data.
I only have 12 variables.
Hi
The event rate of guilty = around 17 % and rest is not guilty.
I suggest you use stratified sampling first and then partition the data set accordingly.(55:20:25) (test:valid:train)
random under sampling
take around 20-22 % of not guilty event observations without replacement and merge them with guilty even rate.
thus : 320 non guilty + 240 guilty = 560 total observations. event rate becomes : 42.8 %.
but this is not very satisfactory, so can use as benchmark.
2.random over sampling:
sampling with replacement.
one approach you can use is : double the instances of guilty, i.e. duplicate the records and total dataset then will have guilty event rate to 29 % overall. but beware this could cause over-fitting.
Use SMOTE.
Thanks.
From the original 1600 i need to select 80% sample. and then split the data into train and validation data.
So what will be the best splitting ratio?
I believe to take a sample containing 80% observations, if you want to touch target variables classes equally then you must try what i mentioned above, else you can start with this in order Train:Validation:Test as
1: 40:30:30 (SAS default)
2: 45:25:30
3. 50:25:25
4. 50 :30:20
like this you can try and see what set gives you a better generalized model.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.