Random Forest

BenjaminD — Tue, 09 Oct 2018 14:09:38 GMT

For Random Forest in SAS what percentage of test data is used for the out-of-bag data?

How does Random Forest in SAS handle missing values?

Does a macro exist that will sweep thru parameters used in PROC HPFOREST?

Thank you,

Ben DeKoven

Re: Random Forest

BrettWujek — Tue, 09 Oct 2018 16:49:25 GMT

Hey Ben - a few years ago I posted a tip about studying the hyperparameters of random forests and SVM. There is some macro code you may find useful in there for your own studies. If you have SAS Viya, and SAS Visual Data Mining and Machine Learning in particular, you have access to a much better built-in mechanism for tuning the hyperparameters called autotuning, which uses optimization techniques to drive the exploration of the model configurations.

You can control the in-bag-fraction (thus indirectly the out of bag fraction) using the INBAGFRACTION option for HPFOREST as answered in this post.

For missing value handling, check out these options in the doc. And a complete explanation of how they are handled is found here.

MINCATSIZE=n

specifies the minimum number of observations that a given nominal input category must have in order to use the category in a split search. Categorical values that appear in fewer than n observations are handled as if they were missing. The categories that occur in fewer than n observations are merged into the pseudo category for missing values for the purpose of finding a split. The policy for assigning such observations to a branch is the same as the policy for assigning missing values to a branch. The default value of n is 5.

MINUSEINSEARCH=n

specifies a threshold for utilizing missing values in the split search when MISSING=USEINSEARCH is specified as the missing value policy. If the number of observations in which the splitting variable has missing values in a node is greater than or equal to n, then PROC HPFOREST initiates the USEINSEARCH policy for missing values. See the section Handling Missing Values for a more complete explanation. The default value of n is 1.

MISSING=USEINSEARCH | BIGBRANCH

specifies how the training procedure handles an observation with missing values. If MISSING=USEINSEARCH and the number of training observations in the node is more than n, where n is the value of the MINUSEINSEARCH= option, then the missing value is used as a separate, legitimate value in the test of association and the split search. If MISSING=BIGBRANCH, observations with a missing value of the candidate variable are omitted from the test of association and split search in that node. A splitting rule will assign such an observation to the branch containing the most observations among those used in the split search. See the section Handling Missing Values for a more complete explanation. By default, MISSING=USEINSEARCH.

Hope this helps.

Brett

Re: Random Forest

BenjaminD — Tue, 09 Oct 2018 17:41:38 GMT

Hello Brett,

This is very helpful information.

Thank you,

Ben DeKoven

topic Random Forest in SAS Data Science

Random Forest

Re: Random Forest

Re: Random Forest