BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
YG1992
Obsidian | Level 7

Hello everyone,

 

I am trying to use SAS Code node with proc hpsplit to achieve hyperparameter-tuning of decision trees in SAS Enterprise Miner. This works and my codes so far are as following:

 

%macro DTStudy (maxbranch=2, maxdepth=5, minleafsize=20);

  %let branchTries = %sysfunc(countw(&maxbranch.));
  %let depthTries = %sysfunc(countw(&maxdepth.));
  %let leafsizeTries = %sysfunc(countw(&minleafsize.));

  %do i = 1 %to &branchTries.;
    %do j = 1 %to &depthTries.;
      %do k = 1 %to &leafsizeTries.;

      %let thisBranch = %sysfunc(scan(&maxbranch.,&i));
      %let thisDepth = %sysfunc(scan(&maxdepth.,&j));
      %let thisLeafsize = %sysfunc(scan(&minleafsize.,&k));

       proc hpsplit data=&em_import_data
         maxbranch=&thisBranch. maxdepth=&thisDepth. nsurrogates=4 minleafsize=&thisLeafsize. mincatsize=2 assignmissing = similar;
         input %EM_INTERVAL_INPUT /level=interval;
         target %EM_TARGET / level=nominal;
         criterion gini;
         /*prune misc / N <= 6;*/
         partition fraction(validate=0.3 seed=12345);
         prune ase;
	     score out=score_b&thisBranch._d&thisDepth._l&thisLeafsize.;
       run;

    /* Add the value of branchToTry/depthToTry/leafsizeToTry for these fit stats */    

       data score_b&thisBranch._d&thisDepth._l&thisLeafsize.;
         length branchToTry $ 8;
         length depthToTry $ 8;
	 length leafsizeToTry $ 8;

         set score_b&thisBranch._d&thisDepth._l&thisLeafsize.;
         branchToTry = "&thisBranch.";
         depthToTry = "&thisDepth.";
         leafsizeToTry = "&thisLeafsize.";

       run;

    /* Append to the single cumulative fit statistics table */   
        
       proc append 
         base=fitStats data=score_b&thisBranch._d&thisDepth._l&thisLeafsize.;
       run;

      %end;
    %end;
 %end;
%mend DTStudy;

%DTStudy(maxbranch=2 3 4 5, maxdepth=5 10 12 15, minleafsize=10 20 30 40);

%em_register(type=Data,key=fitStats);

data &em_user_fitStats;

    set fitStats;

run;

%em_report(viewType=data,key=fitStats,autodisplay=y);

My question is: does there exist an option to limit the minimal sample size of a node to consider further splitting? For example, you can set the "Split size" in the options of Decision Tree node. According to my knowledge, it is a relative important hyperparameter when the sample size is large (which is my situation).

 

I would really appreciate it if anyone could give me some advice and you are also welcome to discuss this problem with me. Thank you very much.

1 ACCEPTED SOLUTION

Accepted Solutions
MikeStockstill
SAS Employee

Unfortunately, the procedure does not have such an option.  However, to a degree, you are getting a similar end result by using the MINLEAFSIZE option that you cited.  The parent (assuming that it is not the root) cannot exist unless it is of size MINLEAFSIZE, and if it does exist, then it cannot be split unless its children are each of size MINLEAFSIZE.  So in that sense, the parent node size is bounded before a split is eligible.

 

If that functionality is important for you to have in a future release, then please visit the online SASware Ballot page at http://support.sas.com/ballot to suggest your ideas.

 

Have a good new year! 

View solution in original post

4 REPLIES 4
MikeStockstill
SAS Employee

Unfortunately, the procedure does not have such an option.  However, to a degree, you are getting a similar end result by using the MINLEAFSIZE option that you cited.  The parent (assuming that it is not the root) cannot exist unless it is of size MINLEAFSIZE, and if it does exist, then it cannot be split unless its children are each of size MINLEAFSIZE.  So in that sense, the parent node size is bounded before a split is eligible.

 

If that functionality is important for you to have in a future release, then please visit the online SASware Ballot page at http://support.sas.com/ballot to suggest your ideas.

 

Have a good new year! 

YG1992
Obsidian | Level 7

Thanks for your quick reply! In fact I think what you said about MINLEAFSIZE is reasonable.

By the way, I would like to mention other situations which happened during testing:

(1) When I use HP Tree node in SAS EM with same maxbranch, maxdepth, minleafsize and mincategoricalsize settings as Decision Tree node, the former one always performs more or less worse than the latter one. I just cannot find the reason.

 

(2) When I use SAS Code node with proc hpsplit to generate trees, the results are always different from the ones that I directly use HP Tree node - also with same settings. For example,

      proc hpsplit data=&em_import_data
         maxbranch=5 maxdepth=10 nsurrogates=4 minleafsize=20 mincatsize=5 assignmissing = similar;
         input %EM_INTERVAL_INPUT /level=interval;
         target %EM_TARGET / level=nominal;
         criterion gini;
         partition fraction(validate=0.3 seed=12345);
         prune ase;
	 score out=score_b5_d5_l10_&thisCat.;
       run;

I finally got a tree with validation AUC around 0.645. But with the same settings of hyperparameters, prune option and partition seed above, I could get a tree with validation AUC of 0.705 through HP Tree node. This is the most confusing part to me, since the HP Tree node just uses the same proc hpsplit within it.

 

 

Hope that you can give my some hints. Thanks very much.

MikeStockstill
SAS Employee

It is possible that the HP Tree node is using a different setting, or different data.  Try these steps to see whether more evidence is displayed.

 

1) Add this statement to the Enterprise Miner Project Start Code:

 

     options mprint symbolgen;

   

    Run now, OK

 

2) Add a new HP Tree node to your diagram, and run it.

 

3) Add a new SAS Code node to your diagram, add your procedure code, and run it.

 

 

The node log in the new HP Tree node has additional code details that might show you more specifics about the HPSPLIT invocation that it uses, so that you can compare that invocation to yours.  Be sure to compare the %EM_INTERVAL_INPUT and %EM_TARGET resolutions to verify that there are no variable-name differences somehow.  Check to make sure that the same data is used.

 

Often a comparison like the one above surfaces some difference that is previously overlooked.  What to suggest next depends on what you find up to that point.

 

 

MikeStockstill
SAS Employee

It is possible that the HP Tree node is using a different setting, or different data.  Try these steps to see whether more evidence is displayed.

 

1) Add this statement to the Enterprise Miner Project Start Code:

 

     options mprint symbolgen;

   

    Run now, OK

 

2) Add a new HP Tree node to your diagram, and run it.

 

3) Add a new SAS Code node to your diagram, add your procedure code, and run it.

 

 

The node log in the new HP Tree node has additional code details that might show you more specifics about the HPSPLIT invocation that it uses, so that you can compare that invocation to yours.  Be sure to compare the %EM_INTERVAL_INPUT and %EM_TARGET resolutions to verify that there are no variable-name differences somehow.  Check to make sure that the same data is used.

 

Often a comparison like the one above surfaces some difference that is previously overlooked.  What to suggest next depends on what you find up to that point.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1738 views
  • 2 likes
  • 2 in conversation