BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
geniusgenie
Obsidian | Level 7

Hi,

I am using 4 different classifiers of Random Forest, SVM, Decision Tree and Neural Network on different datasets in one of the datasets all of the classifiers are giving 100% accuracy which I do not understand why and in other datasets these algorithms are giving above 90% accuracies. Random forest performs best in all datasets. Could anyone please suggest how can I make sure if my algorithms are not overfitting? if yes, then how can I overcome that?

 

 

Regards

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data.   For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data.  

 

You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time.  Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness.  More details would be needed to speculate further on the misclassification aspect. 

 

In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically.  When you have more limited data, you are left with cross-validation options.   When you have very limited data, you are left with assessing things based on your business knowledge.  The less data there is, the more uncertainty you are likely to have.  


With regards to choosing the 'best' model, you need to incorporate your business objectives.  You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish.  You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels.   In the end, your choice of strategy should support the goals you had when you started building the model in the first place.  

 

Hope this helps!

Doug

View solution in original post

8 REPLIES 8
Reeza
Super User
Check the distribution of outcomes in the data. Is one dataset different than the others?
geniusgenie
Obsidian | Level 7

Yes, all datasets are different. 

Reeza
Super User

@geniusgenie wrote:

Yes, all datasets are different. 


That wasn't my question. Are the distributions of the outcome variable you're testing different in the data sets? And if so, can that be what's causing the issue?

geniusgenie
Obsidian | Level 7
Hi Reeza, yes the distributions of outcome variables are different in the datasets and for me it's an issue
WendyCzika
SAS Employee

Are you partitioning your data?

geniusgenie
Obsidian | Level 7

I am partitioning data.

mandata_ad
Calcite | Level 5

Check for leakage on the 100% dataset. 

90% is not unusual for the others.

DougWielenga
SAS Employee

When you get 100% accuracy, you need to go back and check your input variables to make sure you have not inadvertently included a variable containing information you would not have available when scoring new data.   For example, I could easily predict which accounts were going to default if there was a field that indicated how much money was lost when the loan did default, but that information would never be available for new data.  

 

You can also get very high classification ratings (although not 100% typically) when you have a rare event that only happens a small percentage of the time.  Suppose your event happens 1% of the time, then you can say "nobody has the event" and be 99% correct with respect to misclassification yet not have any model that is of any usefulness.  More details would be needed to speculate further on the misclassification aspect. 

 

In Data Mining scenarios, you typically have sufficient data to use holdout data (validation data) to demonstrate the model is useful empirically.  When you have more limited data, you are left with cross-validation options.   When you have very limited data, you are left with assessing things based on your business knowledge.  The less data there is, the more uncertainty you are likely to have.  


With regards to choosing the 'best' model, you need to incorporate your business objectives.  You can choose a model based on many different statistics yet none of them might actually be best suited to your situation depending on the business objectives you are trying to accomplish.  You need to identify your goals and assess how costly it is to misclassify someone which can be complex if you have more than two levels.   In the end, your choice of strategy should support the goals you had when you started building the model in the first place.  

 

Hope this helps!

Doug

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 5631 views
  • 4 likes
  • 5 in conversation