Re: DATA PARTITION RATIO AND BINARY TARGET VARIABLE

Sachin51 · Posted 07-17-2018 03:02 PM

Good day

I need to develop a logistic regression model. with 12 variables and 1 target variable. the target variable is the court hearing outcome of guilt and not guilty verdict.

What is the best data partitioning ratio for a data set of 1600 records in test and validation part?

What is best method to mitigate an unequal target variable with a split of "guilty verdict" = 1380 and "not guilty verdict" = 240?

Reeza · Posted 07-17-2018 03:06 PM

You'll need to add some more stratification variables to that instead of a random sample, in my opinion at least.

Specifically around judges, because different judges are biased in certain directions you'll want to make sure that you have equal records in the three to balance it out.

Sachin51 · Posted 07-17-2018 03:08 PM

Thanks.

But what ratio should I used? 50/50 or 55/45 or 60/40?

Reeza · Posted 07-17-2018 03:19 PM

Take a look at some of your breakdowns and see what's feasible, it does also depend on how many variables you're using. For example, if you have 30 variables you need roughly 25 per variable to get a good estimate so the minimum size for any data set should be 725 - which is too big for your data set, so you'd have to reduce the number of variables or get more observations.

AFAIK, there isn't a hard and fast rule for splitting the data, though, 60/20/20 is what I've seen a lot these days. Don't forget the test data set and make sure to only use it for that. Unless you're doing CV but I would still recommend a neutral test data set.

@Sachin51 wrote:

Thanks.

But what ratio should I used? 50/50 or 55/45 or 60/40?

Sachin51 · Posted 07-17-2018 03:23 PM

I only need to split the data into train and validation data.

I only have 12 variables.

sachinkalra · Posted 07-20-2018 07:04 AM

Hi

The event rate of guilty = around 17 % and rest is not guilty.

I suggest you use stratified sampling first and then partition the data set accordingly.(55:20:25) (test:valid:train)

random under sampling

take around 20-22 % of not guilty event observations without replacement and merge them with guilty even rate.

thus : 320 non guilty + 240 guilty = 560 total observations. event rate becomes : 42.8 %.

but this is not very satisfactory, so can use as benchmark.

2.random over sampling:

sampling with replacement.

one approach you can use is : double the instances of guilty, i.e. duplicate the records and total dataset then will have guilty event rate to 29 % overall. but beware this could cause over-fitting.

Use SMOTE.

Sachin51 · Posted 07-20-2018 03:48 PM

Thanks.

From the original 1600 i need to select 80% sample. and then split the data into train and validation data.

So what will be the best splitting ratio?

sachinkalra · Posted 07-23-2018 08:42 AM

I believe to take a sample containing 80% observations, if you want to touch target variables classes equally then you must try what i mentioned above, else you can start with this in order Train:Validation:Test as

1: 40:30:30 (SAS default)

2: 45:25:30

3. 50:25:25

4. 50 :30:20

like this you can try and see what set gives you a better generalized model.

The 2025 SAS Hackathon has begun!