A new node has been added to Model Studio on SAS Visual Data Mining and Machine Learning 8.5 which, after analyzing your data, automatically generates an entire set of transformed features for export to downstream nodes. It uses a three-step process in generating features. First, it explores the data such that input variables are grouped into categories that share the same statistical profile. This profile uses many variable attributes, including cardinality, coefficient of variation, entropy, qualitative variation, skewness, kurtosis, missingness, and outlier percentage. Next, the node screens input variables to identify variables to be excluded from feature generation, or to be transformed in a specific way. Finally, the variables that survive the screening process are used to generate features, based upon the exploration groupings and as required by the transformation policies which you have selected. In this article, I describe the following functionality and features of this new node:
The input variable screening process scans each input for several data quality issues. These are issues that negate its beneficial usage as a predictive modeling input. If an input exceeds the threshold for one or more of these data quality issues, the input is excluded from further feature generation, or it's identified for the Group Rare feature transformation. Several options are available that allow you to control the screening for these individual data quality issues:
There are seven transformation policies available for selection. The features generated for each policy are designed to treat the data issue ascribed to that policy. Policies flagged with an asterisk are enabled by default.
Multiple features can be generated per input variable, with the type and number of features determined by the transformation policies that are selected. The name of each feature defines the transformation pipeline that's applied for that feature: A feature is named by appending the input variable name (with an underscore) to the transformation name (see the list of feature transformations in the section below). When Feature Selection is enabled (by default), all the features for an input variable are ranked using the Symmetric Uncertainty statistic, and the top ranked features (per input) are selected and output from the node. When disabled, all generated features are output from the node. For Feature Selection, you specify the number of selected features per input with the Number of features per input option (Default=2).
The value that you specify for this option is compared against the feature rank values to determine the selected features. If the ranking of features results in a tie (two or more features have the same SU value), this may result in more features being selected for an input than specified. In the table below, which lists the generated features for Input Variable AGE, the third, fourth, and fifth features are tied at Feature Rank 3. For this example, if the specified number of features per input is 1, the first feature is kept. If the specified number of features is 2, the first two features are kept. However, if the specified number of features is 3, the first five features are kept, since features 3, 4, and 5 all have Rank 3. The Where clause for selecting the features is: Where Feature Rank <= Number of features per input.
Listed below is the set of available feature transformations, grouped by transformation policy. This superset of feature transformations is the source for the naming of all features. For additional information on these transformations, follow this link: https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details23...
Cardinality (Nominal to Interval transformations)
Entropy, Qualitative variation
Kurtosis
Missingness
Outliers
Skewness
Kurtosis, Outliers, Skewness (Low or Medium-rated values)
Kurtosis, Outliers, Skewness (Low-rated values)
After running Feature Machine, click to open the results from the context pop-up menu. When Feature Selection is enabled, the Selected Features report is displayed. This contains the list of selected features, sorted by Input Variable, Feature Rank, and Feature, that are output by the node. Downstream nodes will receive only these features. The Description column, which describes the feature, includes the input variable followed by a colon and the data quality issue, which is then followed by a hyphen and the transformation method. The first feature in the example below has "AGE: Not high (outlier, kurtosis, skewness) - power(2) + impute(median)". The expanded meaning: This feature is for input variable AGE. It addresses the data quality where one or more of Outlier, Kurtosis, and Skewness has a medium value, but none is high. It's transformed by taking the Square and imputing the median value.
When Feature Selection is disabled, the Generated Features report is displayed instead. This contains the list of all generated features, which are output by the node for input into downstream nodes. These are sorted by Input Variable and Feature.
Always displayed is the Output report, which contains the Generated Features print output. This is a listing of all features generated by Feature Machine, even when Feature Selection is enabled. When Feature Selection is disabled, this contains the same information as the Generated Features report.
In this article, I have given an overview of the new Feature Machine node in Model Studio on SAS Visual Data Mining and Machine Learning 8.5, explaining its functionality and how it works. Here are the main points:
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.