Tree-based Imputation in SAS Model Studio

2 Likes

For the Stable 2021.1.4 release (August 2021) and the LTS 2021.2 release (November 2021) of SAS Model Studio, tree-based imputation has been added to the Imputation node.

In the tree-based imputation method, imputation of missing values for an input variable, such as variable x1, is accomplished by training a decision tree that uses all other input variables in the data to predict the value of x1. The model-predicted value is then the imputed value for x1. For an interval variable, a regression tree is trained. For a class variable, a classification tree is trained.

In a Model Studio project where the Imputation node has been added to a pipeline, you can specify the default imputation method for Class Inputs and Interval Inputs. Here, under the two Default method selectors, “Decision tree” is the method by which you can specify tree-based imputation for all class inputs and/or interval inputs.

A handful of options, located under the Decision Tree Options group, is available to control the splitting and pruning of the decision trees. The first sub-group of these, Splitting Options, controls the splitting of the trees.

Splitting options:

Classification tree splitting criterion (Class inputs). Default value: Chi-square.
Possible values:
- CHAID
- Chi-square
- Entropy
- Gini
- Information gain ratio
Regression tree splitting criterion (Interval inputs). Default value: F test.
Possible values:
- CHAID
- F test
- Variance.
Bonferroni (Bonferroni correction). Default: Not selected.
Maximum number of branches. Default value: 2
Maximum depth. Default value: 5
Minimum leaf size. Default value: 5
Missing values (how missing values are handled). Default value: Use in search.
Possible values:
- Largest branch
- Most correlated branch
- Separate branch
- Use in search

The second sub-group, Pruning Options, controls the pruning of the trees.

Pruning options:

Subtree method (pruning method). Default value: Cost complexity.
Possible values:
- Cost complexity
- None (Pruning is not performed)
- Reduced error
Create validation from training data. Default: Selected.
Validation proportion (proportion of training data). Default value: 0.3

Pruning is performed when a Subtree method of “Cost complexity” or “Reduced error” is selected. When the option Create validation from training data is selected, a portion of the training data is used for pruning. By default, based upon the Validation proportion value, 70% of the training data is used for training the trees, and 30% is used for pruning the trees. When the option Create validation from training data is deselected, the validation data is used for pruning if the input data is partitioned to include validation data. If not partitioned, pruning is disabled, since there is no explicit validation data available for pruning the trees. Note: Creating validation from the training data for pruning is recommended, even if the data is partitioned, so that the validation data is reserved for the Supervised Learning nodes.

After running the Imputation node where you have specified the Decision tree method, open the node results to view the Imputed Variables Summary report. In this report, variables with Method=TREE are those that are imputed with tree-based imputation. Also, the Imputed Variable column contains the names of the generated columns that are populated with the imputed values. The original input variables are left alone, being set to rejected by default so that they are not propagated to downstream nodes.

When specifying “Decision tree” as the default method for interval inputs or class inputs, all inputs of either category will be imputed with tree-based imputation. Given that, how do you identify individual variables for the Decision tree method? The place to specify an imputation method for an input variable is the Data tab or the Manage Variables node. However, for the Decision tree method, this functionality is not available prior to the Stable 2021.2.3 release (January 2022) or the LTS 2022.1 release (May 2022). Prior to those releases, this can be achieved in a SAS Code node that precedes the Imputation node.

In the Code editor for the SAS Code node, on the Training Code pane, enter the line of code below for each individual input variable, substituting the variable name. Save and close the editor.

%dmcas_metachange(name=<variableName>, impute=TREE)

In the options for the Imputation node, a default method does not need to be specified for the class/interval inputs (value of “(none)”). If a default method is specified, that method is applied to an input if that input doesn’t have an imputation method specified elsewhere (Data tab, Manage Variables node, SAS Code node).

As an example, two inputs are given the Decision tree method, and the default methods for both Class and Interval inputs are set to “(none)”. After running the pipeline, the Imputed Variables Summary report in the Imputation node results verifies that the two inputs were imputed with the Decision tree (TREE) method.

Also, the Imputation node results contain the Node Score Code, which can be downloaded. This contains the generated tree-based imputation score code for the two variables.

Apart from more traditional “brute-force” methods of imputation, such as Mean, Mode and Median, tree-based imputation provides the more analytical Decision Tree modeling algorithm to predict imputed values in the data, an important addition to SAS Model Studio’s imputation methodology toolbelt.

Tree-based Imputation in SAS Model Studio

Registration is open

SAS AI and Machine Learning Courses