Automating Feature Engineering Task in Model Studio

Feature engineering is a crucial step in the machine learning pipeline. It involves transforming, combining, or creating new features from existing data. Effective feature engineering can significantly enhance a model's ability to learn from data, leading to better performance and accuracy.

The Feature Machine Node

Feature engineering, in addition to data exploration, is the most time-consuming task in model building. This is due to the iterative nature of the task. The Feature Machine node in Model Studio alleviates this problem by automating the task through a parallel, multi-flow generation of features. It uses the featureMachine action that drives a policy-based, scalable feature transformation and generation engine that explores the feature transformation and generation space and executes the potentially effective transformation and generation operators.

Working of Feature Machine Node

This node uses a three-step process to generate features and are discussed below:

Explore Data: First, in the Explore data step, the node applies the explorationPolicy parameter, which specifies how the data are grouped together. The input variables are grouped into categories that share the same statistical profile. This profile uses many variable attributes, including cardinality, coefficient of variation, entropy, qualitative variation, skewness, kurtosis, missingness, and outlier percentage. For example, interval variables that have comparable missing rate, skewness, kurtosis, and outlier values are put in the same group.
Screen Variables: In this step, the node applies the screenPolicy parameter to identify messy variables. These variables are excluded from feature generation or to be transformed in a specific way.
Feature Transformation and Generation: Finally, the variables that survive the screening process are used to generate features. The transformationPolicy parameter determines the number and types of feature transformation and generation operators that are applied to generate the features.

A key feature of contemporary data sets is their high dimensionality combined with a low signal-to-noise ratio, due to the presence of numerous variables that may not be relevant for downstream analysis. Consequently, transforming variables to enhance model performance becomes a crucial aspect of the predictive modeling process. However, the high dimensionality makes it impractical to explore and transform each variable individually. To address this challenge, practitioners often tackle data quality issues iteratively and process variables in a single pass. For instance, you might first address variables with a high rate of missing data and then tackle those with significant skewness. However, this method limits the effective application of available solutions for most data-quality issues. What you select determines which transformations are applied to inputs that exhibits various level of these data issues.

Transformation Policy

Seven transformation policies are available for selection in the node and are listed below:

Cardinality - specifies whether to generate features that include transformations for the treatment of high cardinality. This option is enabled by default.
Entropy - indicates whether to generate features that include transformations for the treatment of low entropy.
Kurtosis - indicates whether to generate features that include transformations for the treatment of high kurtosis.
Missingness - specifies whether to generate features that include transformations for the treatment of missing values. This option is enabled by default.
Outliers - indicates whether to generate features that include transformations for the treatment of outliers.
Qualitative variation - indicates whether to generate features that include transformations for the treatment of low indices of qualitative variation.
Skewness - specifies whether to generate features that include transformations for the treatment of high skewness. This option is enabled by default.

After identifying data quality issues, one can use appropriate transformation technique from the large pool of feature transformations to address the transformation policies in the preceding list. The transformation techniques include- Box-Cox transformation, Decision tree binning, MDLP binning, Median imputation, Missing indicator, Regression tree binning and Target encoding among many others.

Several features from each input variable can be generated, with the type and quantity of these features being dictated by the chosen transformation policies. The name of each feature indicates the specific transformation steps applied to generate it.

Input Variable Screening

Screening options help screen noise variables and variables that need special transformation. This is part of the data exploration step in the workflow. For input variable screening the settings in the screenPolicy parameter result in one of the following three steps:

Remove variables - recommended for variables that have significant data-quality issues and thus are most likely to have a minimal impact on model building.
Transform and keep variables - recommended for variables that have some, but not very significant, data-quality issues. Thus, the variables are first transformed and then used for model building.
Keep variables - recommended for variables that do not have significant data-quality issues. Thus, these variables pass to the next stage as they are.

The noise dimensions that the node considers include the following:

Coefficient of variation - identifies interval variables that have a low coefficient of variation (close to constant value). These variables are excluded from feature processing. This property is enabled by default.
Group rare levels - identifies nominal variables that have rare levels. These variables are transformed by rare level grouping. This property is enabled by default.
Leakage percent threshold - identifies variables that have a very high level of information about the target (leakage variables). Variables that exceed this threshold (target entropy reduction) are excluded from feature processing. The default value is 90.
Mutual information threshold - identifies variables that have a low level of information about the target (not informative). Variables that fall below this threshold are excluded from feature processing. The default value is 0.05.
Redundancy threshold - identifies variables that are redundant (highly correlated). If the symmetric uncertainty coefficient, a measure of nominal association for two variables, exceeds this threshold, the variable that has less information about the target is excluded from feature processing. The default value is 1. With this default value, redundancy screening is not enabled. You can enable this property by specifying a value less than 1.

Demonstration

Next, I will run a demonstration that illustrates using the Feature Machine node to automatically generate a set of transformed features. These features can then be used for model building.

In the demonstration, I will be using a data set (commsdata) from a fictitious telecommunications company that seeks to determine which customers might be likely to churn. The data set contains a reasonable amount of data that describes consumer behavior. The input variables include demographic type variables, variables that describe product usage and type, billing data, and customer service/call center information. The main goal is to use these input variables or transformed features and train supervised (or semi-supervised) learning models to predict the churn event. Though this demonstration will be limited to generating a set of transformed features only.

I will be using Model Studio to generate transformed features before building models. I am assuming that you are already familiar with the steps required to create a pipeline in Model Studio. If not, check out Build Models with SAS Model Studio | SAS Viya Quick Start Tutorial. I start with a blank template, i.e. a pipeline that just has a data node and add a Data Exploration node. To add a Data Exploration node, right click on Data node and select Add child node -> Miscellaneous -> Data Exploration. In the Data Exploration properties pane, change the Variable selection criterion to Screening. This will help me understand the data issues. Next, I click Run Pipeline. The pipeline should resemble the following:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

When the pipeline finishes running, right-click the Data Exploration node and select Results.

Click the Expand button on the Suspicious Interval Variables window and examine the data quality issues of the interval variables. It clearly shows that several variables have high skewness, and kurtosis.

Scroll down in the Data Exploration Results window to examine the Missing Values bar chart and validate that quite a few variables have missing values.

This exploration clearly suggests the data quality issues and the need for appropriate transformation before using these features for model building. Now instead of manually trying various transformations to address the identified data quality issues, I would use Feature Machine node that can automatically generate new features by applying necessary variable transformations.

Right-click the Data node and select Add child node -> Data Mining Preprocessing -> Feature Machine.

In the Feature Machine node options panel, keep the default settings for Transformation Policy and Input Variable Screening. By default, Cardinality, Missingness, and Skewness are selected under the Transformation Policy property to fix the data quality issues. These settings indicate whether to generate features that include transformations for the data issues related to the treatment of high cardinality, missing values, and high skewness.

Similarly, the settings under the Input Variable Screening property control whether the variables are removed from feature generation or transformed and kept for feature generation.

Again, in the node options panel, note that Feature Selection is enabled by default to subset the list of multiple generated features.

All the features for an input variable are ranked using the symmetric uncertainty (SU) coefficient, and the top-ranked features (per input) are selected and output from the node. When it is not enabled, all generated features are output from the node. For Feature Selection, you specify the number of selected features per input by using the Number of features per input property. (The default is 2.) This value is compared to the ranking values to determine the selected features. If the ranking of features results in a tie (that is, two or more features have the same SU value), this can result in more features being selected for an input than specified. For any input variable that has a feature that is output, the input variable is dropped (rejected in metadata) by default.

Click Run Pipeline and examine results.

Expand the Selected Features table.

This table is displayed when Feature Selection is enabled. This table contains the list of selected features (sorted by input variable, feature rank, and feature) that are output by the node. Downstream nodes receive only these features. The Description column, which describes the feature, includes the input variable followed by a colon and the data quality issue, which is followed by a hyphen and the transformation method. Let’s examine the first feature description from the Selected Features output table, Est_HH_Income: Not high (outlier, kurtosis, skewness) - power(-1) + impute(median). This can be deciphered as follows: This feature is for the input variable Est_HH_Income (Estimated HH Income). It addresses the data quality where one or more of the statistical measures for outliers, kurtosis, and skewness have a medium value, but none of them have a high value. It is transformed by taking the inverse and imputing the median value. Refer to the Suspicious Interval Variables table from the analysis of Data Exploration node, which clearly indicated Est_HH_Income variable having high skewness and kurtosis. Feature Machine node not only identified the data quality issues but also applied the appropriate transformation to address it. Close the Selected Features table and expand the Output window.

The Output report contains all features that the Feature Machine node generates, regardless of whether Feature Selection is enabled. In the table above, which lists the generated features, for the input variable Est_HH_Income, the first and second features are tied at Feature Rank 1. In this example, if the specified number of features per input is 1, the first and second features are kept. If the specified number of features per input is 2, then, again, the first and second features are kept. If you examine the table further, you see that the fourth and fifth features are tied at Feature Rank 4. So now if a user specifies the number of features per input as 4, then the first through fifth features are kept because features 4 and 5 have Feature Rank 4.

When Feature Selection is not enabled, the Generated Features report is displayed instead. This report contains the list of all generated features, which are output by the node for input into downstream nodes.

Conclusions

In summary, feature engineering plays a key role in enhancing model performance, but it requires considerable time and effort because of the need for understanding the data, domain expertise, iterative refinement, and extensive testing. However, the Feature Machine node can assist analysts in speeding up this time-intensive data preprocessing process.

To learn more on automated feature engineering, consider taking this course: Advanced Machine Learning Using SAS® Viya®

Find more articles from SAS Global Enablement and Learning here.