In the analytic lifecycle, feature engineering is a critical and an essential step that comes before model building. It typically involves creating and engineering features (or inputs) to build better predictive models. Typically, you start with a set of features in your data; sometimes they need precise extraction; or you need to use statistical or data mining techniques to extract them in such a way that they are more manageable and useful. Other times, you need go out and get additional data (and features) to augment what you have. Whatever the process, the goal of feature engineering is almost always about building better models; models that not only have high predictive power but are also generalizable (that is, predict well on new or unseen data).
This post categorizes feature engineering into four groups and explains how those techniques can be implemented and compared next to one another using SAS Visual Data Mining and Machine Learning 8.3 in SAS Viya. Note that the terms feature and input are used interchangeably in this post.
Building and extracting good features requires experience and domain knowledge about the problem and the data you are working with. The four groups used to classify feature engineering techniques are:
Let us look at each in more detail ...
 Constructing new features is more akin to data preparation where you might need to apply a numerical transformation to a feature, aggregate two or more features into a single feature, or decompose a single feature into many for better representation. Data massaging techniques like the log transformation or binning of interval inputs or level encoding of high cardinality categorical inputs also fall in this category. These methods can be implemented in SAS Studio using the SAS DATA Step code or in Model Studio using the Transformations node or the SAS Code node.
 Selecting key features involves reducing the number of features by selecting a subset that explain the most variance in the inputs (unsupervised) or the most variance in the target (supervised). These methods preserve interpretability and can be implemented using VARREDUCE procedure in SAS Studio or the Variable Selection node in Model Studio. The Variable Selection node provides additional functionality by allowing the user to perform an unsupervised selection followed by a supervised selection or combine unsupervised and supervised selections using a combination criterion (like selected by at least 1, selected by majority, selected by all and so on).
 Clustering features into groups will divide features into disjointed clusters where features within a cluster have high similarity or correlation. You can then choose to keep one feature from each cluster or compute the first principal component of each cluster using principal component analysis or PCA. When using PCA, you will lose interpretability because it uses the first principal component instead of the original feature as a cluster representative. This technique can be implemented using the GVARCLUS procedure in SAS Studio or the Variable Clustering node in Model Studio.
 Extracting new features includes transforming features in such a way that fewer latent features represent most of the variance in the data. This method is very useful if your data has a large number of inputs. Since new features are generated in this case, you will lose interpretability. Below are the list of techniques that fall in this category and they can be implemented using the Feature Extraction node in Model Studio or with the PCA, RPCA and NNET procedures in SAS Studio.
||Uses interval inputs only and captures linearity in inputs.|
||Uses all inputs and captures non-linearity in inputs|
In summary, the following table provides an overview of the feature engineering capabilities available in SAS Visual Data Mining and Machine Learning.
The Model Studio application makes it easier to identify techniques that work best for your data and the problem at hand. It enables users to quickly configure and compare multiple techniques via drag-drop of its nodes. In Figure 2 below, Variable Clustering, Feature Extraction using PCA, Feature Extraction using Autoencoder, Variable Selection and Feature Extraction using Robust PCA are used for feature engineering and they are followed by the same gradient boosting model for comparing which of those techniques work best. Based on your business needs, the Model Comparison node at the end can be used to choose the champion using the fit statistics of the gradient boosting model or to choose one of the top performers when there is a requirement to keep the feature engineering technique interpretable.
One of the great strengths of Model Studio is the support for sharing and reusing different project components. In the latest release (SAS Visual Data Mining and Machine Learning 8.3), Model Studio ships with an automated feature engineering template that can be used as a good starting point for automating the feature engineering process. Refer to the following posts from @Funda_SAS for additional details on using this template.
Lastly, this post is also available as a video since it was presented as a super demo at the 2018 SAS Global Forum.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.