Hi,
"Surrogate Rule=" interacts with Missing Value option in either DT or the GB node. When surrogate rule =0 (default in the GB node), Missing value's default kicks in, which is "Use in Search" for both DT and the GB nodes. "Use in Search" does not mean carrying the missing value forward. Instead, it imputes it, implicitly and legitimately, with rules like "grouping it with the branch that maximizes the worth of split". Now, when you set surrogate rule=2, in both DT and GB nodes, you tell EM to override the "Missing Value" default (="Use in Search"). When surrogating kicks in (data driven, no guarantee it will), AND it cannot find non-missing surrogates up to 2 levels, you get a missing value assigned to the non-leaf node (in the 'middle stream'). That missing status in DT does not create problem. If the DT gives you a terminal node that has missing value in it, so what? In (most) GB cases where max branches are relatively large (assuming other stopping rules do not stop spliting prematurely), say 100, a non-leaf node born at, say, 7th branch/level, carrying missing value, may get great chance to be re-surrogated-2 successfully at depth 78 for example (although carrying missing values down the path from branch 7 all the way down to 78 may very well be deemed as analytically unacceptable). However, the chance (see you only give us max depth=10) is the missing value gets to the very bottom layer of the trees.
Again, this does not pose problem for DT. However, GB needs to build loss function to re-iterate. How are we supposed to build a loss function with missing input? We don't put rules in GB engine to tell it to drop the OBS due to missing loss function (that is sample violation). So the GB halts without a model. The software developer, however, cannot flag it as error because there is no error from design and quality of the product perspectives. Because you told it to surrogate 2. Surrogating should be applied often when the depth is deep where the surrogates are most similar to the missing spot for better info quality, or you have a great 'visual' of the surrounding of the missing. Surrogating rule= is retained today due to historical usage of GB. Historically GB was often used for universes with transparent /small research data, not industrial or noise as we often have today. --Jason Xin
... View more