topic Re: How to check overfitting in SAS Data Science

How to check overfitting

geniusgenie — Wed, 16 Aug 2017 22:16:03 GMT

Hi,

I am using 4 different classifiers of Random Forest, SVM, Decision Tree and Neural Network on different datasets in one of the datasets all of the classifiers are giving 100% accuracy which I do not understand why and in other datasets these algorithms are giving above 90% accuracies. Random forest performs best in all datasets. Could anyone please suggest how can I make sure if my algorithms are not overfitting? if yes, then how can I overcome that?

Regards

Re: How to check overfitting

Reeza — Wed, 16 Aug 2017 22:17:41 GMT

Check the distribution of outcomes in the data. Is one dataset different than the others?

Re: How to check overfitting

geniusgenie — Wed, 16 Aug 2017 22:29:52 GMT

Yes, all datasets are different.

Re: How to check overfitting

Reeza — Thu, 17 Aug 2017 00:05:51 GMT

@geniusgenie wrote:

Yes, all datasets are different.

That wasn't my question. Are the distributions of the outcome variable you're testing different in the data sets? And if so, can that be what's causing the issue?

Re: How to check overfitting

geniusgenie — Thu, 17 Aug 2017 18:45:14 GMT

Hi Reeza, yes the distributions of outcome variables are different in the datasets and for me it's an issue

Re: How to check overfitting

WendyCzika — Thu, 17 Aug 2017 19:56:26 GMT

Are you partitioning your data?

Re: How to check overfitting

geniusgenie — Thu, 17 Aug 2017 22:12:39 GMT

I am partitioning data.

Re: How to check overfitting

mandata_ad — Fri, 18 Aug 2017 18:38:40 GMT

Check for leakage on the 100% dataset.

90% is not unusual for the others.

Re: How to check overfitting

DougWielenga — Thu, 24 Aug 2017 21:19:04 GMT

When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data. For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data.

You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time. Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness. More details would be needed to speculate further on the misclassification aspect.

In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically. When you have more limited data, you are left with cross-validation options. When you have very limited data, you are left with assessing things based on your business knowledge. The less data there is, the more uncertainty you are likely to have.

With regards to choosing the 'best' model, you need to incorporate your business objectives. You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish. You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels. In the end, your choice of strategy should support the goals you had when you started building the model in the first place.

Hope this helps!

Doug