BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
PiratDrogowy
Calcite | Level 5

Hi,

In SAS Enterprise Miner Help I've found how to approximate CHAID and CART methods using Decision Tree node, but there is nothing about C4.5 algorithm.

How can I simulate C4.5 algorithm using Decision Tree node?

I would be grateful for any help.

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

 In short, it was a design decision to avoid having a specific setting for C4.5 several reasons.  Tree peformance can only be hindered by limiting splitting on interval inputs to two-way splits.  A paper in the early 1990s compared C4.5 using an interval input with C4.5 using that same input discretized into 10 or so values.  Trees with the discretized variable were better because of the bias towards categorical (hence multi-way) splits.   We also opted to exclude the main C4.5 splitting criterion, Gain Ratio, which is ENTROPY divided by another factor in an attempt to avoid generating too many branches from a categorical input.   

Regarding settings in Enterprise Miner: 

   + Split search:  EHAUSTIVE only kicks in when x and y both have more than 2 nominal categories.  Then, C4.5 makes a multiway split. Initially one branch for each x value, and then merges the branches using GainRatio.   Reducing the number of branches is different with the C4.5 Gain Ratio than with the CHAID approach.  I don't think setting EXHAUSTIVE to anything special will help the comparison.   

   + Node Sample:  PERFORMANCE NODESAMPLE=ALL;   We are strongly considering getting rid of the NODESAMPLE option at some point in the future. 

   + Subtree:  Use best assessed subtree for ASE or Misclassification.  The C4.5 author calls this 'Error based pruning'.  I believe the C4.5 default is 'pessimistic pruning' which SAS does not offer. 

    + P-values adjustment: C4.5 does not use a criterion with p-values.   

In short, there are several areas of concern for the C4.5 approach which is why this approach is not fully represented in SAS Enterprise Miner.  I hope this is helpful. 

 

Cordially,
Doug

View solution in original post

1 REPLY 1
DougWielenga
SAS Employee

 In short, it was a design decision to avoid having a specific setting for C4.5 several reasons.  Tree peformance can only be hindered by limiting splitting on interval inputs to two-way splits.  A paper in the early 1990s compared C4.5 using an interval input with C4.5 using that same input discretized into 10 or so values.  Trees with the discretized variable were better because of the bias towards categorical (hence multi-way) splits.   We also opted to exclude the main C4.5 splitting criterion, Gain Ratio, which is ENTROPY divided by another factor in an attempt to avoid generating too many branches from a categorical input.   

Regarding settings in Enterprise Miner: 

   + Split search:  EHAUSTIVE only kicks in when x and y both have more than 2 nominal categories.  Then, C4.5 makes a multiway split. Initially one branch for each x value, and then merges the branches using GainRatio.   Reducing the number of branches is different with the C4.5 Gain Ratio than with the CHAID approach.  I don't think setting EXHAUSTIVE to anything special will help the comparison.   

   + Node Sample:  PERFORMANCE NODESAMPLE=ALL;   We are strongly considering getting rid of the NODESAMPLE option at some point in the future. 

   + Subtree:  Use best assessed subtree for ASE or Misclassification.  The C4.5 author calls this 'Error based pruning'.  I believe the C4.5 default is 'pessimistic pruning' which SAS does not offer. 

    + P-values adjustment: C4.5 does not use a criterion with p-values.   

In short, there are several areas of concern for the C4.5 approach which is why this approach is not fully represented in SAS Enterprise Miner.  I hope this is helpful. 

 

Cordially,
Doug

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1652 views
  • 0 likes
  • 2 in conversation