BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
herzizza88
Calcite | Level 5

Hi,

 

I am out of options to solve this.  I did a decision tree and want to check the misclassification rate to see if there is any overfit and the optimal number of leaves, but it gave me this.Didnt help that all my own lecturer said was "Doesn't look good" without any explanation. 

Screen Shot 2018-12-13 at 10.01.56 AM.png

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

I did a decision tree and want to check the misclassification rate to see if there is any overfit and the optimal number of leaves, but it gave me this.Didnt help that all my own lecturer said was "Doesn't look good" without any explanation. 

 

What you are looking at is a tree that did not split whatsoever.   As a result, the single node tree predicts every observations as being in the most common category.   It appears that your rarest event occurs about 34% of the time and your common event happens about 66% of the time since the 34% misclassification rate corresponds to the proportion of rare events that were predicted to be the common event (since every observation gets the same prediction when no splits are made).   

 

It is common to encounter this in rare event scenarios (that is, when you have a small percentage of target events in the training data) but you have at least 1/3 of your data with the target event.  It is possible that using some decision weights to increase the chance for the more rare event to be chosen and thereby achieve some splitting, but it is more likely that you need to determine if you have one of the following situations:

 

1 - Very weak input variables which are only slightly related to the outcome if at all -- if so,consider trying to fit the tree interactively to see if there are any useful predictors at all)

2 - A target variable which is difficult to measure accurately (e.g. customer satisfaction) -- if so, consider what you can do to better define and/or measure the target of interest

3 - Improperly prepared input data (e.g. are you getting all the possible information out of your input data?) -- if so, consider seeing what transformations might be helpful, such as taking data with timestamps which are non typically useful outside of forecasting models and transforming the timestamps into variables like year/quarter/month or taking variable which have too many levels such as SKU number and creating one or more variables which represent meaningful categories of SKU numbers.

 

For information on how to use decision weights to better fit rare event scenarios (even though your event is not overly rare), see the solution in the community article linked below:

 

https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/A-Question-on-Modeling-Rare-Events-Data/m...

 

Hope this helps!

Doug

View solution in original post

1 REPLY 1
DougWielenga
SAS Employee

I did a decision tree and want to check the misclassification rate to see if there is any overfit and the optimal number of leaves, but it gave me this.Didnt help that all my own lecturer said was "Doesn't look good" without any explanation. 

 

What you are looking at is a tree that did not split whatsoever.   As a result, the single node tree predicts every observations as being in the most common category.   It appears that your rarest event occurs about 34% of the time and your common event happens about 66% of the time since the 34% misclassification rate corresponds to the proportion of rare events that were predicted to be the common event (since every observation gets the same prediction when no splits are made).   

 

It is common to encounter this in rare event scenarios (that is, when you have a small percentage of target events in the training data) but you have at least 1/3 of your data with the target event.  It is possible that using some decision weights to increase the chance for the more rare event to be chosen and thereby achieve some splitting, but it is more likely that you need to determine if you have one of the following situations:

 

1 - Very weak input variables which are only slightly related to the outcome if at all -- if so,consider trying to fit the tree interactively to see if there are any useful predictors at all)

2 - A target variable which is difficult to measure accurately (e.g. customer satisfaction) -- if so, consider what you can do to better define and/or measure the target of interest

3 - Improperly prepared input data (e.g. are you getting all the possible information out of your input data?) -- if so, consider seeing what transformations might be helpful, such as taking data with timestamps which are non typically useful outside of forecasting models and transforming the timestamps into variables like year/quarter/month or taking variable which have too many levels such as SKU number and creating one or more variables which represent meaningful categories of SKU numbers.

 

For information on how to use decision weights to better fit rare event scenarios (even though your event is not overly rare), see the solution in the community article linked below:

 

https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/A-Question-on-Modeling-Rare-Events-Data/m...

 

Hope this helps!

Doug

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 4166 views
  • 2 likes
  • 2 in conversation