The automated feature engineering template is the newest addition to the collection of templates in SAS Model Studio on SAS Visual Data Mining and Machine Learning 8.3. A gentle introduction to feature engineering and why automaton is desired can be found at my recent blog post "Automate your feature engineering".
By using automated feature transformation and extraction techniques, this new template automatically creates newly engineered features. The idea is to automatically learn a set of features (from potentially noisy, raw data) that can be useful in supervised learning tasks without the need to handcraft new features.
This template creates new features in three steps:
A more detailed explanation of the three steps is as follows:
The first node in this step is the SAS Code High Cardinality. The SAS Code node is my favorite in Model Studio because it provides infinite opportunities for experienced programmers to incorporate their own code into their pipeline for more customized tasks while still enjoying the user-friendly, point-and-click environment of Model Studio for other tasks. You can easily see the content of this node by clicking on it and then opening the Code Editor. The following is contained in this SAS Code node:
This SAS Code node first identifies the high-cardinality variables (nominal variables that have too many unique levels) as nominal input variables that have between 20 and 1,000 levels. You can easily modify this range by updating the values for minlevels and maxlevels. Then, it specifies a numeric transformation (TRANSFORM=LEVELENCODE), level encoding only for those variables. Note that the SAS code specifies the level encoding transformation in the metadata. To apply this transformation, you need to run a Transformations node (a data pre-processing node). Therefore, the connected Transformations node’s role is simply to implement the level encoding transformation that is already specified in the metadata. This is why level encoding is not specified again in the Transformation node.
Level encoding is a simple transformation that converts nominal features to numeric features. It is especially useful for dealing with high-cardinality variables because these variables are often taxing for most machine learning algorithms in terms of computing resources. Level encoding is the simplest nominal-to-numeric transformation; it first orders the levels of a nominal variable in alphabetical order, and then assigns numbers (starting from 1) to each level in ascending order. More efficient nominal-to-numeric transformations, including target-based transformations (e.g., weight of evidence [WOE]) and other similar transformations, are available as part of the “Data Preprocess Action Set” in SAS® Viya®.
Step 2 consists of the following nodes, which apply three different automated feature engineering techniques:
Transformations – Best: This node uses the Transformations node to specify the “Best” transformation for all interval variables. For each interval variable, this method compares single variable transformations (such as inverse transformation, standardization, centering, and log transformation) based on a ranking criterion (such as correlation with the target) and selects the transformation that has the highest ranking. For a more detailed explanation of this method, see “Best transformation – a new feature in Model Studio 8.3”
Feature Extraction – PCA: This node uses the Feature Extraction preprocessing node to specify the Automatic feature extraction technique for the interval input variables. If the total number of interval input variables is less than or equal to 500, automatic feature extraction is equivalent to principal component analysis; otherwise, it is equivalent to singular value decomposition (SVD).
Feature Extraction – Autoencoder: This node specifies the autoencoder feature extraction technique for creating new features. This technique uses all the input variables (interval and nominal) for feature extraction. An autoencoder is an unsupervised learning technique whose objective is to learn a set of features that can be used to reconstruct the input data. Briefly, a neural network is trained by setting the target neurons equal to the input neurons. There are multiple layers and a bottleneck in the middle layer, so the network is forced to learn a reduced-dimensional internal representation of the inputs before reconstructing them in the output layer. In this template, the middle-hidden layer is set to 10, which means that ten new features will be created.
To compare the performance of the five different feature sets (three automatically engineered sets, the original set with level encoding for the high-cardinality variables, and the original set without level encoding), the five feature sets are used as inputs for the gradient boosting algorithm. We chose to use the gradient boosting algorithm because it is a very effective supervised learning algorithm that often outperforms other algorithms in terms of predictive accuracy. Automatic hyperparameter tuning (autotuning) is turned on to find the optimal hyperparameter settings of the gradient boosting algorithm so the comparison between feature sets is more fair and not dependent on the hyperparameters. However, keep in mind that autotuning comes with an additional computing cost. If this step takes too long to run, you can change the autotuning settings, or simply turn it off and use the default hyperparameter settings.
The Model Comparison node, at the bottom of the preceding figure, compares the five gradient boosting models and reports performance based on various metrics for existing partitions of the data (e.g. training validation and test sets).
To use this template more effectively, you first need to increase the maximum class level in Model Studio (the default value is 20) to include high-cardinality variables in the analysis. You can increase this level only when you create a New Project by clicking the Advanced settings. The next figure shows that the Maximum class level is increased to 1,000.
It is important to remember that using this template does not guarantee that one of the newly created feature sets will perform better than the original features for your data, because every data set is unique and there is no guarantee that these techniques will work. Instead, the goal of this template is to show an example that you can follow to create different automatically engineered feature sets by using many other nodes and tools provided in Model Studio and testing their performance in a similar way with minimal effort. Automation through these templates allows you see some simple ideas first to check if there is any value in your datasets before you invest more time, and it can help your company to make quick decisions with lower costs.
Awesome
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.