10-29-2016 09:22 PM
I am using the HPSLIT command to run a classification tree. In the Grow statement, I have used "entropy." However, I recently learned that this may be sensitive to imbalanced data. One of my outcome groups is almost double in size compared to the other. Does anyone have suggestions for which of the other Grow options (CHAID, CHISQUARE, FASTCHAID, and GINI) may be less sensitive to imbalanced data? Thank you!
10-31-2016 03:57 PM
Both Entropy and Gini can be sensitive to unbalanced data, as the value for the node purity is based off of the proportion of observations in the node with the different response levels. Usually this is a larger problem in rare event modeling. One outcome group being twice the size of another is not as likely to be a large issue.
Additionally, CHAID and FastCHAID both should be less sensitive to the data with imbalanced numbers of each outcome group than Entropy and Gini. That being said, if the imbalance is too large, it might be better practice to oversample the data beforehand.
If you have the time and the arrangement, I would recommend building several decision trees using different criterion, and then use validation data to determine the best tree.