BookmarkSubscribeRSS Feed
smb11
Calcite | Level 5

I am using the HPSLIT command to run a classification tree. In the Grow statement, I have used "entropy." However, I recently learned that this may be sensitive to imbalanced data. One of my outcome groups is almost double in size compared to the other. Does anyone have suggestions for which of the other Grow options (CHAID, CHISQUARE, FASTCHAID, and GINI) may be less sensitive to imbalanced data? Thank you!

1 REPLY 1
RalphAbbey
SAS Employee

Both Entropy and Gini can be sensitive to unbalanced data, as the value for the node purity is based off of the proportion of observations in the node with the different response levels. Usually this is a larger problem in rare event modeling. One outcome group being twice the size of another is not as likely to be a large issue.

 

Additionally, CHAID and FastCHAID both should be less sensitive to the data with imbalanced numbers of each outcome group than Entropy and Gini. That being said, if the imbalance is too large, it might be better practice to oversample the data beforehand.

 

If you have the time and the arrangement, I would recommend building several decision trees using different criterion, and then use validation data to determine the best tree.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1214 views
  • 0 likes
  • 2 in conversation