Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

How to approximate C4.5 algorithm in SAS EM 6.2 with the Decision Tree node?

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

How to approximate C4.5 algorithm in SAS EM 6.2 with the Decision Tree node?

Hi,

In SAS Enterprise Miner Help I've found how to approximate CHAID and CART methods using Decision Tree node, but there is nothing about C4.5 algorithm.

How can I simulate C4.5 algorithm using Decision Tree node?

I would be grateful for any help.


Accepted Solutions
Solution
a week ago
SAS Employee
Posts: 121

Re: How to approximate C4.5 algorithm in SAS EM 6.2 with the Decision Tree node?

[ Edited ]

 In short, it was a design decision to avoid having a specific setting for C4.5 several reasons.  Tree peformance can only be hindered by limiting splitting on interval inputs to two-way splits.  A paper in the early 1990s compared C4.5 using an interval input with C4.5 using that same input discretized into 10 or so values.  Trees with the discretized variable were better because of the bias towards categorical (hence multi-way) splits.   We also opted to exclude the main C4.5 splitting criterion, Gain Ratio, which is ENTROPY divided by another factor in an attempt to avoid generating too many branches from a categorical input.   

Regarding settings in Enterprise Miner: 

   + Split search:  EHAUSTIVE only kicks in when x and y both have more than 2 nominal categories.  Then, C4.5 makes a multiway split. Initially one branch for each x value, and then merges the branches using GainRatio.   Reducing the number of branches is different with the C4.5 Gain Ratio than with the CHAID approach.  I don't think setting EXHAUSTIVE to anything special will help the comparison.   

   + Node Sample:  PERFORMANCE NODESAMPLE=ALL;   We are strongly considering getting rid of the NODESAMPLE option at some point in the future. 

   + Subtree:  Use best assessed subtree for ASE or Misclassification.  The C4.5 author calls this 'Error based pruning'.  I believe the C4.5 default is 'pessimistic pruning' which SAS does not offer. 

    + P-values adjustment: C4.5 does not use a criterion with p-values.   

In short, there are several areas of concern for the C4.5 approach which is why this approach is not fully represented in SAS Enterprise Miner.  I hope this is helpful. 

 

Cordially,
Doug

View solution in original post


All Replies
Solution
a week ago
SAS Employee
Posts: 121

Re: How to approximate C4.5 algorithm in SAS EM 6.2 with the Decision Tree node?

[ Edited ]

 In short, it was a design decision to avoid having a specific setting for C4.5 several reasons.  Tree peformance can only be hindered by limiting splitting on interval inputs to two-way splits.  A paper in the early 1990s compared C4.5 using an interval input with C4.5 using that same input discretized into 10 or so values.  Trees with the discretized variable were better because of the bias towards categorical (hence multi-way) splits.   We also opted to exclude the main C4.5 splitting criterion, Gain Ratio, which is ENTROPY divided by another factor in an attempt to avoid generating too many branches from a categorical input.   

Regarding settings in Enterprise Miner: 

   + Split search:  EHAUSTIVE only kicks in when x and y both have more than 2 nominal categories.  Then, C4.5 makes a multiway split. Initially one branch for each x value, and then merges the branches using GainRatio.   Reducing the number of branches is different with the C4.5 Gain Ratio than with the CHAID approach.  I don't think setting EXHAUSTIVE to anything special will help the comparison.   

   + Node Sample:  PERFORMANCE NODESAMPLE=ALL;   We are strongly considering getting rid of the NODESAMPLE option at some point in the future. 

   + Subtree:  Use best assessed subtree for ASE or Misclassification.  The C4.5 author calls this 'Error based pruning'.  I believe the C4.5 default is 'pessimistic pruning' which SAS does not offer. 

    + P-values adjustment: C4.5 does not use a criterion with p-values.   

In short, there are several areas of concern for the C4.5 approach which is why this approach is not fully represented in SAS Enterprise Miner.  I hope this is helpful. 

 

Cordially,
Doug

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 712 views
  • 0 likes
  • 2 in conversation