BookmarkSubscribeRSS Feed
JTho
Calcite | Level 5

Colleagues:

Running a decision tree in SAS-EM. I have a sample with 5.2 million records. The prevalence of the EVENT is 3% (156,000 events). 

 

It seems to me that a minimum terminal node sample size of 10, 50, 100, even 1,000 may be too small. Is anyone aware of a rule (or rule-of-thumb) for setting a reasonable min node for large datasets? 

 

As always, thank you for any suggestions/advice.

-Josh

1 REPLY 1
DougWielenga
SAS Employee

Josh,

 

You can get very different looking trees depending on many of the settings you choose.  The settings you choose depend in part on your business objectives.   For example, suppose I fit a tree model for a binary target and plan to take certain action with anyone who had a probability of response greater than 0.6.   In this scenario, continuing to split a particular node several times when all terminal nodes still remain below the threshold does not change your action plan.  It is possible to obtain child nodes with a probability higher than a parent node which is higher in the tree but if all child nodes are below the threshold in question, the additional splitting does nothing to inform you on what to do since the people in the parent node and child nodes would still be treated the same.  This does not mean that there is no value in splitting further since this might produce insights into relationships in the data that might warrant further investigation whether or not it impacts your eventual action plan or not.   In like manner suppose I'm targeting groups of people in the terminal node but it need the groups to be of a certain size (say 1,000).   Now creating splits which generate terminal nodes smaller than some threshold (say 500) is not necessarily helpful since you aren't planning on taking action on such a small group.  

 

Think of the Decision Tree node properties as options for building/pruning/stopping the growth of your tree.  At any given point, some of these options might be having an impact on the splitting while others are not.    In the end, it takes very little time to fit a variety of tree models so that you can evaluate each in light of your business objectives to make the best choice for your situation.  

 

Hope this helps!

Doug

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 821 views
  • 0 likes
  • 2 in conversation