BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BenjaminD
Calcite | Level 5

For Random Forest in SAS what percentage of test data is used for the out-of-bag data?

 

How does Random Forest in SAS handle missing values?

 

Does a macro exist that will sweep thru parameters used in PROC HPFOREST?

 

Thank you,

Ben DeKoven

1 ACCEPTED SOLUTION

Accepted Solutions
BenjaminD
Calcite | Level 5

Hello Brett,

 

This is very helpful information.

 

Thank you,

Ben DeKoven

View solution in original post

2 REPLIES 2
BrettWujek
SAS Employee

Hey Ben - a few years ago I posted a tip about studying the hyperparameters of random forests and SVM. There is some macro code you may find useful in there for your own studies.  If you have SAS Viya, and SAS Visual Data Mining and Machine Learning in particular, you have access to a much better built-in mechanism for tuning the hyperparameters called autotuning, which uses optimization techniques to drive the exploration of the model configurations.

 

You can control the in-bag-fraction (thus indirectly the out of bag fraction) using the INBAGFRACTION option for HPFOREST as answered in this post.

 

For missing value handling, check out these options in the doc. And a complete explanation of how they are handled is found here.

 

MINCATSIZE=n

specifies the minimum number of observations that a given nominal input category must have in order to use the category in a split search. Categorical values that appear in fewer than n observations are handled as if they were missing. The categories that occur in fewer than n observations are merged into the pseudo category for missing values for the purpose of finding a split. The policy for assigning such observations to a branch is the same as the policy for assigning missing values to a branch. The default value of n is 5.

MINUSEINSEARCH=n

specifies a threshold for utilizing missing values in the split search when MISSING=USEINSEARCH is specified as the missing value policy. If the number of observations in which the splitting variable has missing values in a node is greater than or equal to n, then PROC HPFOREST initiates the USEINSEARCH policy for missing values. See the section Handling Missing Values for a more complete explanation. The default value of n is 1.

MISSING=USEINSEARCH | BIGBRANCH

specifies how the training procedure handles an observation with missing values. If MISSING=USEINSEARCH and the number of training observations in the node is more than n, where n is the value of the MINUSEINSEARCH= option, then the missing value is used as a separate, legitimate value in the test of association and the split search. If MISSING=BIGBRANCH, observations with a missing value of the candidate variable are omitted from the test of association and split search in that node. A splitting rule will assign such an observation to the branch containing the most observations among those used in the split search. See the section Handling Missing Values for a more complete explanation. By default, MISSING=USEINSEARCH.

 

 

Hope this helps.

Brett

 

 


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

BenjaminD
Calcite | Level 5

Hello Brett,

 

This is very helpful information.

 

Thank you,

Ben DeKoven

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2983 views
  • 1 like
  • 2 in conversation