BookmarkSubscribeRSS Feed
Sachin51
Calcite | Level 5

Good day

 

I need to develop a logistic regression model.  with 12 variables and 1 target variable.  the target variable is the court hearing outcome of guilt and not guilty verdict.

 

What is the best data partitioning ratio for a data set of 1600 records in test and validation part?

 

What is best method to mitigate an unequal target variable with a split of "guilty verdict" = 1380 and "not guilty verdict" = 240?

7 REPLIES 7
Reeza
Super User

You'll need to add some more stratification variables to that instead of a random sample, in my opinion at least. 

Specifically around judges, because different judges are biased in certain directions you'll want to make sure that you have equal records in the three to balance it out.

 

 

Sachin51
Calcite | Level 5

Thanks.

 

But what ratio should I used? 50/50 or 55/45 or 60/40?

Reeza
Super User

Take a look at some of your breakdowns and see what's feasible, it does also depend on how many variables you're using. For example, if you have 30 variables you need roughly 25 per variable to get a good estimate so the minimum size for any data set should be 725 - which is too big for your data set, so you'd have to reduce the number of variables or get more observations.

 

AFAIK, there isn't a hard and fast rule for splitting the data, though, 60/20/20 is what I've seen a lot these days. Don't forget the test data set and make sure to only use it for that. Unless you're doing CV but I would still recommend a neutral test data set.

 


@Sachin51 wrote:

Thanks.

 

But what ratio should I used? 50/50 or 55/45 or 60/40?


 

 

Sachin51
Calcite | Level 5

I only need to split the data into train and validation data.

 

I only have 12 variables.

sachinkalra
Obsidian | Level 7

Hi

The event rate of guilty = around 17 % and rest is not guilty.

I suggest you use stratified sampling first and then partition the data set accordingly.(55:20:25) (test:valid:train)

 

random under sampling

take around 20-22 % of not guilty event observations without replacement and merge them with guilty even rate.

thus : 320 non guilty + 240 guilty = 560 total observations. event rate becomes : 42.8 %.

 

but this is not very satisfactory, so can use as benchmark. 

 

2.random over sampling:

sampling with replacement.

 

one approach you can use is : double the instances of guilty, i.e. duplicate the records and total dataset then will have guilty event rate  to 29 % overall. but beware this could cause over-fitting.

 

 

Use SMOTE.

Sachin51
Calcite | Level 5

Thanks.

 

From the original 1600 i need to select 80% sample. and then split the data into train and validation data.

 

So what will be the best splitting ratio?

 

 

sachinkalra
Obsidian | Level 7

I believe to take a sample containing 80% observations, if you want to touch target variables classes equally then you must try what i mentioned above, else you can start with this in order Train:Validation:Test as

1: 40:30:30 (SAS default)

2: 45:25:30

3. 50:25:25

4. 50 :30:20

 

like this you can try and see what set gives you a better generalized model.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2135 views
  • 0 likes
  • 3 in conversation