03-06-2017 04:28 PM
Hello, I'm runing the following steps across multiple independent iterations for testing purposes: (1) run HPForest against the same input dataset with the same parameters including the SEED value, (2) save the binary score code, and (3) score the same target dataset with that binary score code. In this case the target variable is categorical. I notice that predicted class probabilities are almost always different across runs, and that the final class prediction, based on the max class probability, can be different on occasion. I would have expected the same result each time given that the input dataset, all parameters, and the random seed are fixed. I'm running 9.04.01M3P062415 on WX64_SV. Has anyone else seen this behavior? Is this expected? Thanks!
03-13-2017 02:21 PM
Yes, I would expect different results. The random forest is an ensemble model of many decision trees. Each tree is built on a randomly selected subset of observations (rows) AND at each node, only a randomly selected subset of variables is available for splitting.
03-13-2017 02:34 PM
If the observations are being read in parallel, then the order of observations in memory is somewhat randomized: the fastest thread of the moment gets it's block of observations to be the first in HPFOREST memory. The Out-Of-Bag sample is chosen by in-memory observation number. Different Out-Of-Bag samples produce different trees. To check this theory, you can set THREADS=1 on the PERFORMANCE statement.
When HPFOREST runs on a cluster of machines (MPP mode, that is), then reproducibility is foiled by the system that gives random machine numbers to HPFOREST. HPFOREST includes the machine number in some random choices so as not to have different machines doing identical things.