Solved: How to check overfitting

geniusgenie · Posted 08-16-2017 06:16 PM

Hi,

I am using 4 different classifiers of Random Forest, SVM, Decision Tree and Neural Network on different datasets in one of the datasets all of the classifiers are giving 100% accuracy which I do not understand why and in other datasets these algorithms are giving above 90% accuracies. Random forest performs best in all datasets. Could anyone please suggest how can I make sure if my algorithms are not overfitting? if yes, then how can I overcome that?

Regards

DougWielenga · Posted 08-24-2017 05:13 PM

When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data. For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data.

You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time. Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness. More details would be needed to speculate further on the misclassification aspect.

In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically. When you have more limited data, you are left with cross-validation options. When you have very limited data, you are left with assessing things based on your business knowledge. The less data there is, the more uncertainty you are likely to have.

With regards to choosing the 'best' model, you need to incorporate your business objectives. You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish. You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels. In the end, your choice of strategy should support the goals you had when you started building the model in the first place.

Hope this helps!

Doug

View solution in original post

Reeza · Posted 08-16-2017 06:17 PM

Check the distribution of outcomes in the data. Is one dataset different than the others?

geniusgenie · Posted 08-16-2017 06:29 PM

Yes, all datasets are different.

Reeza · Posted 08-16-2017 08:05 PM

@geniusgenie wrote:

Yes, all datasets are different.

That wasn't my question. Are the distributions of the outcome variable you're testing different in the data sets? And if so, can that be what's causing the issue?

geniusgenie · Posted 08-17-2017 02:45 PM

Hi Reeza, yes the distributions of outcome variables are different in the datasets and for me it's an issue

WendyCzika · Posted 08-17-2017 03:56 PM

Are you partitioning your data?

geniusgenie · Posted 08-17-2017 06:12 PM

I am partitioning data.

mandata_ad · Posted 08-18-2017 02:11 PM

Check for leakage on the 100% dataset.

90% is not unusual for the others.

DougWielenga · Posted 08-24-2017 05:13 PM

When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data. For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data.

You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time. Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness. More details would be needed to speculate further on the misclassification aspect.

In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically. When you have more limited data, you are left with cross-validation options. When you have very limited data, you are left with assessing things based on your business knowledge. The less data there is, the more uncertainty you are likely to have.

With regards to choosing the 'best' model, you need to incorporate your business objectives. You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish. You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels. In the end, your choice of strategy should support the goals you had when you started building the model in the first place.

Hope this helps!

Doug

How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting

Re: How to check overfitting