Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

HPSPLIT Grow Statement for Imbalanced Data

Reply
Occasional Learner
Posts: 1

HPSPLIT Grow Statement for Imbalanced Data

I am using the HPSLIT command to run a classification tree. In the Grow statement, I have used "entropy." However, I recently learned that this may be sensitive to imbalanced data. One of my outcome groups is almost double in size compared to the other. Does anyone have suggestions for which of the other Grow options (CHAID, CHISQUARE, FASTCHAID, and GINI) may be less sensitive to imbalanced data? Thank you!

SAS Employee
Posts: 15

Re: HPSPLIT Grow Statement for Imbalanced Data

Both Entropy and Gini can be sensitive to unbalanced data, as the value for the node purity is based off of the proportion of observations in the node with the different response levels. Usually this is a larger problem in rare event modeling. One outcome group being twice the size of another is not as likely to be a large issue.

 

Additionally, CHAID and FastCHAID both should be less sensitive to the data with imbalanced numbers of each outcome group than Entropy and Gini. That being said, if the imbalance is too large, it might be better practice to oversample the data beforehand.

 

If you have the time and the arrangement, I would recommend building several decision trees using different criterion, and then use validation data to determine the best tree.

Ask a Question
Discussion stats
  • 1 reply
  • 220 views
  • 0 likes
  • 2 in conversation