Feature engineering is a crucial step in the machine learning pipeline. It involves transforming, combining, or creating new features from existing data. Effective feature engineering can significantly enhance a model's ability to learn from data, leading to better performance and accuracy.
Feature engineering, in addition to data exploration, is the most time-consuming task in model building. This is due to the iterative nature of the task. The Feature Machine node in Model Studio alleviates this problem by automating the task through a parallel, multi-flow generation of features. It uses the featureMachine action that drives a policy-based, scalable feature transformation and generation engine that explores the feature transformation and generation space and executes the potentially effective transformation and generation operators.
This node uses a three-step process to generate features and are discussed below:
A key feature of contemporary data sets is their high dimensionality combined with a low signal-to-noise ratio, due to the presence of numerous variables that may not be relevant for downstream analysis. Consequently, transforming variables to enhance model performance becomes a crucial aspect of the predictive modeling process. However, the high dimensionality makes it impractical to explore and transform each variable individually. To address this challenge, practitioners often tackle data quality issues iteratively and process variables in a single pass. For instance, you might first address variables with a high rate of missing data and then tackle those with significant skewness. However, this method limits the effective application of available solutions for most data-quality issues. What you select determines which transformations are applied to inputs that exhibits various level of these data issues.
Seven transformation policies are available for selection in the node and are listed below:
After identifying data quality issues, one can use appropriate transformation technique from the large pool of feature transformations to address the transformation policies in the preceding list. The transformation techniques include- Box-Cox transformation, Decision tree binning, MDLP binning, Median imputation, Missing indicator, Regression tree binning and Target encoding among many others.
Several features from each input variable can be generated, with the type and quantity of these features being dictated by the chosen transformation policies. The name of each feature indicates the specific transformation steps applied to generate it.
Screening options help screen noise variables and variables that need special transformation. This is part of the data exploration step in the workflow. For input variable screening the settings in the screenPolicy parameter result in one of the following three steps:
The noise dimensions that the node considers include the following:
Next, I will run a demonstration that illustrates using the Feature Machine node to automatically generate a set of transformed features. These features can then be used for model building.
In the demonstration, I will be using a data set (commsdata) from a fictitious telecommunications company that seeks to determine which customers might be likely to churn. The data set contains a reasonable amount of data that describes consumer behavior. The input variables include demographic type variables, variables that describe product usage and type, billing data, and customer service/call center information. The main goal is to use these input variables or transformed features and train supervised (or semi-supervised) learning models to predict the churn event. Though this demonstration will be limited to generating a set of transformed features only.
I will be using Model Studio to generate transformed features before building models. I am assuming that you are already familiar with the steps required to create a pipeline in Model Studio. If not, check out Build Models with SAS Model Studio | SAS Viya Quick Start Tutorial. I start with a blank template, i.e. a pipeline that just has a data node and add a Data Exploration node. To add a Data Exploration node, right click on Data node and select Add child node -> Miscellaneous -> Data Exploration. In the Data Exploration properties pane, change the Variable selection criterion to Screening. This will help me understand the data issues. Next, I click Run Pipeline. The pipeline should resemble the following:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
When the pipeline finishes running, right-click the Data Exploration node and select Results.
Click the Expand button on the Suspicious Interval Variables window and examine the data quality issues of the interval variables. It clearly shows that several variables have high skewness, and kurtosis.
Scroll down in the Data Exploration Results window to examine the Missing Values bar chart and validate that quite a few variables have missing values.
This exploration clearly suggests the data quality issues and the need for appropriate transformation before using these features for model building. Now instead of manually trying various transformations to address the identified data quality issues, I would use Feature Machine node that can automatically generate new features by applying necessary variable transformations.
Right-click the Data node and select Add child node -> Data Mining Preprocessing -> Feature Machine.
In the Feature Machine node options panel, keep the default settings for Transformation Policy and Input Variable Screening. By default, Cardinality, Missingness, and Skewness are selected under the Transformation Policy property to fix the data quality issues. These settings indicate whether to generate features that include transformations for the data issues related to the treatment of high cardinality, missing values, and high skewness.
Similarly, the settings under the Input Variable Screening property control whether the variables are removed from feature generation or transformed and kept for feature generation.
Again, in the node options panel, note that Feature Selection is enabled by default to subset the list of multiple generated features.
All the features for an input variable are ranked using the symmetric uncertainty (SU) coefficient, and the top-ranked features (per input) are selected and output from the node. When it is not enabled, all generated features are output from the node. For Feature Selection, you specify the number of selected features per input by using the Number of features per input property. (The default is 2.) This value is compared to the ranking values to determine the selected features. If the ranking of features results in a tie (that is, two or more features have the same SU value), this can result in more features being selected for an input than specified. For any input variable that has a feature that is output, the input variable is dropped (rejected in metadata) by default.
Click Run Pipeline and examine results.
Expand the Selected Features table.
This table is displayed when Feature Selection is enabled. This table contains the list of selected features (sorted by input variable, feature rank, and feature) that are output by the node. Downstream nodes receive only these features. The Description column, which describes the feature, includes the input variable followed by a colon and the data quality issue, which is followed by a hyphen and the transformation method. Let’s examine the first feature description from the Selected Features output table, Est_HH_Income: Not high (outlier, kurtosis, skewness) - power(-1) + impute(median). This can be deciphered as follows: This feature is for the input variable Est_HH_Income (Estimated HH Income). It addresses the data quality where one or more of the statistical measures for outliers, kurtosis, and skewness have a medium value, but none of them have a high value. It is transformed by taking the inverse and imputing the median value. Refer to the Suspicious Interval Variables table from the analysis of Data Exploration node, which clearly indicated Est_HH_Income variable having high skewness and kurtosis. Feature Machine node not only identified the data quality issues but also applied the appropriate transformation to address it. Close the Selected Features table and expand the Output window.
The Output report contains all features that the Feature Machine node generates, regardless of whether Feature Selection is enabled. In the table above, which lists the generated features, for the input variable Est_HH_Income, the first and second features are tied at Feature Rank 1. In this example, if the specified number of features per input is 1, the first and second features are kept. If the specified number of features per input is 2, then, again, the first and second features are kept. If you examine the table further, you see that the fourth and fifth features are tied at Feature Rank 4. So now if a user specifies the number of features per input as 4, then the first through fifth features are kept because features 4 and 5 have Feature Rank 4.
When Feature Selection is not enabled, the Generated Features report is displayed instead. This report contains the list of all generated features, which are output by the node for input into downstream nodes.
In summary, feature engineering plays a key role in enhancing model performance, but it requires considerable time and effort because of the need for understanding the data, domain expertise, iterative refinement, and extensive testing. However, the Feature Machine node can assist analysts in speeding up this time-intensive data preprocessing process.
To learn more on automated feature engineering, consider taking this course: Advanced Machine Learning Using SAS® Viya®
Find more articles from SAS Global Enablement and Learning here.
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.