Hi,
I am using 4 different classifiers of Random Forest, SVM, Decision Tree and Neural Network on different datasets in one of the datasets all of the classifiers are giving 100% accuracy which I do not understand why and in other datasets these algorithms are giving above 90% accuracies. Random forest performs best in all datasets. Could anyone please suggest how can I make sure if my algorithms are not overfitting? if yes, then how can I overcome that?
Regards
When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data. For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data.
You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time. Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness. More details would be needed to speculate further on the misclassification aspect.
In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically. When you have more limited data, you are left with cross-validation options. When you have very limited data, you are left with assessing things based on your business knowledge. The less data there is, the more uncertainty you are likely to have.
With regards to choosing the 'best' model, you need to incorporate your business objectives. You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish. You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels. In the end, your choice of strategy should support the goals you had when you started building the model in the first place.
Hope this helps!
Doug
Yes, all datasets are different.
@geniusgenie wrote:
Yes, all datasets are different.
That wasn't my question. Are the distributions of the outcome variable you're testing different in the data sets? And if so, can that be what's causing the issue?
Are you partitioning your data?
I am partitioning data.
Check for leakage on the 100% dataset.
90% is not unusual for the others.
When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data. For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data.
You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time. Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness. More details would be needed to speculate further on the misclassification aspect.
In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically. When you have more limited data, you are left with cross-validation options. When you have very limited data, you are left with assessing things based on your business knowledge. The less data there is, the more uncertainty you are likely to have.
With regards to choosing the 'best' model, you need to incorporate your business objectives. You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish. You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels. In the end, your choice of strategy should support the goals you had when you started building the model in the first place.
Hope this helps!
Doug
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.