4 ways to classify feature engineering in SAS Viya

11 Likes

In the analytic lifecycle, feature engineering is a critical and an essential step that comes before model building. It typically involves creating and engineering features (or inputs) to build better predictive models. Typically, you start with a set of features in your data; sometimes they need precise extraction; or you need to use statistical or data mining techniques to extract them in such a way that they are more manageable and useful. Other times, you need go out and get additional data (and features) to augment what you have. Whatever the process, the goal of feature engineering is almost always about building better models; models that not only have high predictive power but are also generalizable (that is, predict well on new or unseen data).

This post categorizes feature engineering into four groups and explains how those techniques can be implemented and compared next to one another using SAS Visual Data Mining and Machine Learning 8.3 in SAS Viya. Note that the terms feature and input are used interchangeably in this post.

Building and extracting good features requires experience and domain knowledge about the problem and the data you are working with. The four groups used to classify feature engineering techniques are:

Constructing new features from a combination of one or more existing features
Selecting key features using supervised or unsupervised techniques
Clustering features into groups
Extracting new features from existing features

Let us look at each in more detail ...

[1] Constructing new features is more akin to data preparation where you might need to apply a numerical transformation to a feature, aggregate two or more features into a single feature, or decompose a single feature into many for better representation. Data massaging techniques like the log transformation or binning of interval inputs or level encoding of high cardinality categorical inputs also fall in this category. These methods can be implemented in SAS Studio using the SAS DATA Step code or in Model Studio using the Transformations node or the SAS Code node.

[2] Selecting key features involves reducing the number of features by selecting a subset that explain the most variance in the inputs (unsupervised) or the most variance in the target (supervised). These methods preserve interpretability and can be implemented using VARREDUCE procedure in SAS Studio or the Variable Selection node in Model Studio. The Variable Selection node provides additional functionality by allowing the user to perform an unsupervised selection followed by a supervised selection or combine unsupervised and supervised selections using a combination criterion (like selected by at least 1, selected by majority, selected by all and so on).

[3] Clustering features into groups will divide features into disjointed clusters where features within a cluster have high similarity or correlation. You can then choose to keep one feature from each cluster or compute the first principal component of each cluster using principal component analysis or PCA. When using PCA, you will lose interpretability because it uses the first principal component instead of the original feature as a cluster representative. This technique can be implemented using the GVARCLUS procedure in SAS Studio or the Variable Clustering node in Model Studio.

[4] Extracting new features includes transforming features in such a way that fewer latent features represent most of the variance in the data. This method is very useful if your data has a large number of inputs. Since new features are generated in this case, you will lose interpretability. Below are the list of techniques that fall in this category and they can be implemented using the Feature Extraction node in Model Studio or with the PCA, RPCA and NNET procedures in SAS Studio.

Principal component analysis or PCA Singular value decomposition or SVD Robust PCA	Uses interval inputs only and captures linearity in inputs.
Autoencoder	Uses all inputs and captures non-linearity in inputs

In summary, the following table provides an overview of the feature engineering capabilities available in SAS Visual Data Mining and Machine Learning.

Feature engineering in SAS Visual Data Mining and Machine Learning Feature engineering in SAS Visual Data Mining and Machine Learning

Try out various feature engineering techniques in Model Studio

The Model Studio application makes it easier to identify techniques that work best for your data and the problem at hand. It enables users to quickly configure and compare multiple techniques via drag-drop of its nodes. In Figure 2 below, Variable Clustering, Feature Extraction using PCA, Feature Extraction using Autoencoder, Variable Selection and Feature Extraction using Robust PCA are used for feature engineering and they are followed by the same gradient boosting model for comparing which of those techniques work best. Based on your business needs, the Model Comparison node at the end can be used to choose the champion using the fit statistics of the gradient boosting model or to choose one of the top performers when there is a requirement to keep the feature engineering technique interpretable.

Feature engineering in Model Studio Feature engineering in Model Studio

Automated feature engineering in Model Studio

One of the great strengths of Model Studio is the support for sharing and reusing different project components. In the latest release (SAS Visual Data Mining and Machine Learning 8.3), Model Studio ships with an automated feature engineering template that can be used as a good starting point for automating the feature engineering process. Refer to the following posts from @Funda_SAS for additional details on using this template.

Watch the demo

Lastly, this post is also available as a video since it was presented as a super demo at the 2018 SAS Global Forum.

Stefan_Stoyanov · ‎07-06-2020

A link to the demo video on YouTube: https://www.youtube.com/watch?v=D9_jXt_sdX8&feature=youtu.be 🙂

SAS Communities Library