SAS Data Science

Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Viya (Machine Learning), SAS Visual Text Analytics, with point-and-click interfaces or programming
BookmarkSubscribeRSS Feed
JTho
Calcite | Level 5

Colleagues:

Running a decision tree in SAS-EM. I have a sample with 5.2 million records. The prevalence of the EVENT is 3% (156,000 events). 

 

It seems to me that a minimum terminal node sample size of 10, 50, 100, even 1,000 may be too small. Is anyone aware of a rule (or rule-of-thumb) for setting a reasonable min node for large datasets? 

 

As always, thank you for any suggestions/advice.

-Josh

1 REPLY 1
DougWielenga
SAS Employee

Josh,

 

You can get very different looking trees depending on many of the settings you choose.  The settings you choose depend in part on your business objectives.   For example, suppose I fit a tree model for a binary target and plan to take certain action with anyone who had a probability of response greater than 0.6.   In this scenario, continuing to split a particular node several times when all terminal nodes still remain below the threshold does not change your action plan.  It is possible to obtain child nodes with a probability higher than a parent node which is higher in the tree but if all child nodes are below the threshold in question, the additional splitting does nothing to inform you on what to do since the people in the parent node and child nodes would still be treated the same.  This does not mean that there is no value in splitting further since this might produce insights into relationships in the data that might warrant further investigation whether or not it impacts your eventual action plan or not.   In like manner suppose I'm targeting groups of people in the terminal node but it need the groups to be of a certain size (say 1,000).   Now creating splits which generate terminal nodes smaller than some threshold (say 500) is not necessarily helpful since you aren't planning on taking action on such a small group.  

 

Think of the Decision Tree node properties as options for building/pruning/stopping the growth of your tree.  At any given point, some of these options might be having an impact on the splitting while others are not.    In the end, it takes very little time to fit a variety of tree models so that you can evaluate each in light of your business objectives to make the best choice for your situation.  

 

Hope this helps!

Doug

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1356 views
  • 0 likes
  • 2 in conversation