Overview
A new node has been added to Model Studio on SAS Visual Data Mining and Machine Learning 8.5 which, after analyzing your data, automatically generates an entire set of transformed features for export to downstream nodes. It uses a three-step process in generating features. First, it explores the data such that input variables are grouped into categories that share the same statistical profile. This profile uses many variable attributes, including cardinality, coefficient of variation, entropy, qualitative variation, skewness, kurtosis, missingness, and outlier percentage. Next, the node screens input variables to identify variables to be excluded from feature generation, or to be transformed in a specific way. Finally, the variables that survive the screening process are used to generate features, based upon the exploration groupings and as required by the transformation policies which you have selected. In this article, I describe the following functionality and features of this new node:
Input Variable Screening
Transformation Policies
Feature Generation and Selection
Feature Transformations
Feature Machine Results
Input Variable Screening
The input variable screening process scans each input for several data quality issues. These are issues that negate its beneficial usage as a predictive modeling input. If an input exceeds the threshold for one or more of these data quality issues, the input is excluded from further feature generation, or it's identified for the Group Rare feature transformation. Several options are available that allow you to control the screening for these individual data quality issues:
Coefficient of variation – Identify interval variables that have a low coefficient of variation (close to constant value). These variables are excluded from feature processing. Enabled by default.
Group rare levels – Identify nominal variables that have rare levels. These variables are transformed by rare level grouping. Enabled by default.
Leakage percent threshold – Identify variables that have a very high level of information about the target (leakage variables). Variables that exceed your specified threshold (target entropy reduction) are excluded from feature processing. Default=90.
Mutual information threshold – Identify variables that have a low level of information about the target (not informative). Variables that are below your specified threshold are excluded from feature processing. Default=0.05.
Redundancy threshold – Identify variables that are redundant (highly correlated). If the Symmetric Uncertainty for two variables exceeds your specified threshold, the variable that has less information about the target is excluded from feature processing. Default=1. Redundancy screening is not enabled with this default value.
Transformation Policies
There are seven transformation policies available for selection. The features generated for each policy are designed to treat the data issue ascribed to that policy. Policies flagged with an asterisk are enabled by default.
Cardinality - Treatment of high cardinality*
Entropy - Treatment of low entropy
Kurtosis - Treatment of high kurtosis
Missingness - Treatment of missing values*
Outliers - Treatment of outliers
Qualitative variation - Treatment of low indices of qualitative variation
Skewness - Treatment of high skewness*
Feature Generation and Selection
Multiple features can be generated per input variable, with the type and number of features determined by the transformation policies that are selected. The name of each feature defines the transformation pipeline that's applied for that feature: A feature is named by appending the input variable name (with an underscore) to the transformation name (see the list of feature transformations in the section below). When Feature Selection is enabled (by default), all the features for an input variable are ranked using the Symmetric Uncertainty statistic, and the top ranked features (per input) are selected and output from the node. When disabled, all generated features are output from the node. For Feature Selection, you specify the number of selected features per input with the Number of features per input option (Default=2).
The value that you specify for this option is compared against the feature rank values to determine the selected features. If the ranking of features results in a tie (two or more features have the same SU value), this may result in more features being selected for an input than specified. In the table below, which lists the generated features for Input Variable AGE, the third, fourth, and fifth features are tied at Feature Rank 3. For this example, if the specified number of features per input is 1, the first feature is kept. If the specified number of features is 2, the first two features are kept. However, if the specified number of features is 3, the first five features are kept, since features 3, 4, and 5 all have Rank 3. The Where clause for selecting the features is: Where Feature Rank <= Number of features per input.
Feature Transformations
Listed below is the set of available feature transformations, grouped by transformation policy. This superset of feature transformations is the source for the naming of all features. For additional information on these transformations, follow this link: https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details23.htm&docsetVersion=8.5&locale=en
Cardinality (Nominal to Interval transformations)
hc_tar_mean – Mean target encoding
hc_tar_min – Minimum target encoding
hc_tar_max – Maximum target encoding
hc_tar_frq_rat – Frequency ratio target encoding
hc_tar_woe – Weight of evidence target encoding
hc_tar_evt_prob – Event probability target encoding
hc_lbl_cnt – Level count rank
hc_cnt – Level count
hc_cnt_log – Level count followed by Log transformation
Entropy, Qualitative variation
grp_rare1 – Mode imputation and group rare levels
grp_rare2 – Missing level and group rare levels
lchehi_lab – Label encoding
lcnhenhi_grp_rare – Group rare levels
lcnhenhi_rtree5 – Five-bin regression tree binning
lcnhenhi_rtree10 – Ten-bin regression tree binning
lcnhenhi_dtree5 – Five-bin decision tree binning
lcnhenhi_dtree10 – Ten-bin decision tree binning
lcnhenhi_woe5 – Five-bin Weight of Evidence binning
lcnhenhi_woe10 – Ten-bin Weight of Evidence binning
Kurtosis
hk_yj – Yeo-Johnson power transformations with parameters -2, -1, 0, 1, 2
hk_dtree_disct5 – Five-bin decision tree binning
hk_dtree_disct10 – Ten-bin decision tree binning
hk_rtree_disct5 – Five-bin regression tree binning
hk_rtree_disct10 – Ten-bin regression tree binning
Missingness
cpy_int_med_imp – Median imputation
cpy_nom_mode_imp_lab – Mode imputation and Label encoding
cpy_nom_miss_lev_lab – Missing level and Label encoding
miss_ind – Missing indicator
Outliers
ho_winsor - Winsorization
ho_quan_disct5 – Five-bin quantile binning
ho_quan_disct10 – Ten-bin quantile binning
ho_dtree_disct5 – Five-bin decision tree binning
ho_dtree_disct10 – Ten-bin decision tree binning
ho_rtree_disct5 – Five-bin regression tree binning
ho_rtree_disct10 – Ten-bin regression tree binning
Skewness
hs_bc – Box-Cox power transformations with parameters -2, -1, 0, 1, 2
hs_dtree_disct5 – Five-bin decision tree binning
hs_dtree_disct10 – Ten-bin decision tree binning
hs_rtree_disct5 – Five-bin regression tree binning
hs_rtree_disct10 – Ten-bin regression tree binning
Kurtosis, Outliers, Skewness (Low or Medium-rated values)
nhoks_nloks_pow, nhoks_nloks_log – Tukey's ladder of power transformations with parameters -2, -1, -0.5, 0, 0.5, 1, 2
nhoks_nloks_dtree5 – Five-bin decision tree binning
nhoks_nloks_dtree10 – Ten-bin decision tree binning
nhoks_nloks_rtree5 – Five-bin regression tree binning
nhoks_nloks_rtree10 – Ten-bin regression tree binning
Kurtosis, Outliers, Skewness (Low-rated values)
all_l_oks_dtree5 – Five-bin decision tree binning
all_l_oks_dtree10 – Ten-bin decision tree binning
all_l_oks_rtree5 – Five-bin regression tree binning
all_l_oks_rtree10 – Ten-bin regression tree binning
Feature Machine Results
After running Feature Machine, click to open the results from the context pop-up menu. When Feature Selection is enabled, the Selected Features report is displayed. This contains the list of selected features, sorted by Input Variable, Feature Rank, and Feature, that are output by the node. Downstream nodes will receive only these features. The Description column, which describes the feature, includes the input variable followed by a colon and the data quality issue, which is then followed by a hyphen and the transformation method. The first feature in the example below has "AGE: Not high (outlier, kurtosis, skewness) - power(2) + impute(median)". The expanded meaning: This feature is for input variable AGE. It addresses the data quality where one or more of Outlier, Kurtosis, and Skewness has a medium value, but none is high. It's transformed by taking the Square and imputing the median value.
When Feature Selection is disabled, the Generated Features report is displayed instead. This contains the list of all generated features, which are output by the node for input into downstream nodes. These are sorted by Input Variable and Feature.
Always displayed is the Output report, which contains the Generated Features print output. This is a listing of all features generated by Feature Machine, even when Feature Selection is enabled. When Feature Selection is disabled, this contains the same information as the Generated Features report.
Summary
In this article, I have given an overview of the new Feature Machine node in Model Studio on SAS Visual Data Mining and Machine Learning 8.5, explaining its functionality and how it works. Here are the main points:
With this node, you automatically generate transformed features which address different data issues that negatively impact predictive modeling.
The number and type of these features are based upon the transformation policies that you select.
Variable screening is performed to exclude, from feature generation, variables that exceed certain data assessment thresholds.
With the Feature Selection option, you control how many features are selected per input variable. Selected features are the top N ranked features, which are exported to downstream nodes.
... View more