BookmarkSubscribeRSS Feed

New Feature Machine node in Model Studio on SAS Visual Data Mining and Machine Learning 8.5

Started ‎09-18-2020 by
Modified ‎12-19-2019 by
Views 5,109

Overview

A new node has been added to Model Studio on SAS Visual Data Mining and Machine Learning 8.5 which, after analyzing your data, automatically generates an entire set of transformed features for export to downstream nodes.  It uses a three-step process in generating features.  First, it explores the data such that input variables are grouped into categories that share the same statistical profile.  This profile uses many variable attributes, including cardinality, coefficient of variation, entropy, qualitative variation, skewness, kurtosis, missingness, and outlier percentage.  Next, the node screens input variables to identify variables to be excluded from feature generation, or to be transformed in a specific way.  Finally, the variables that survive the screening process are used to generate features, based upon the exploration groupings and as required by the transformation policies which you have selected.  In this article, I describe the following functionality and features of this new node:

  1. Input Variable Screening
  2. Transformation Policies
  3. Feature Generation and Selection
  4. Feature Transformations
  5. Feature Machine Results

 

 

Input Variable Screening

The input variable screening process scans each input for several data quality issues.  These are issues that negate its beneficial usage as a predictive modeling input.  If an input exceeds the threshold for one or more of these data quality issues, the input is excluded from further feature generation, or it's identified for the Group Rare feature transformation.   Several options are available that allow you to control the screening for these individual data quality issues: 

  1. Coefficient of variation Identify interval variables that have a low coefficient of variation (close to constant value). These variables are excluded from feature processing.  Enabled by default.
  2. Group rare levels – Identify nominal variables that have rare levels.  These variables are transformed by rare level grouping.  Enabled by default.
  3. Leakage percent threshold – Identify variables that have a very high level of information about the target (leakage variables). Variables that exceed your specified threshold (target entropy reduction) are excluded from feature processing.  Default=90.
  4. Mutual information threshold – Identify variables that have a low level of information about the target (not informative). Variables that are below your specified threshold are excluded from feature processing.  Default=0.05.
  5. Redundancy threshold – Identify variables that are redundant (highly correlated). If the Symmetric Uncertainty for two variables exceeds your specified threshold, the variable that has less information about the target is excluded from feature processing.  Default=1.  Redundancy screening is not enabled with this default value.

featureMachine2.png

 

 

Transformation Policies

There are seven transformation policies available for selectionThe features generated for each policy are designed to treat the data issue ascribed to that policy.  Policies flagged with an asterisk are enabled by default. 

  1. Cardinality - Treatment of high cardinality*
  2. Entropy - Treatment of low entropy
  3. Kurtosis - Treatment of high kurtosis
  4. Missingness - Treatment of missing values*
  5. Outliers - Treatment of outliers
  6. Qualitative variation - Treatment of low indices of qualitative variation
  7. Skewness - Treatment of high skewness*

featureMachine.png

 

 

Feature Generation and Selection

Multiple features can be generated per input variable, with the type and number of features determined by the transformation policies that are selected.  The name of each feature defines the transformation pipeline that's applied for that featureA feature is named by appending the input variable name (with an underscore) to the transformation name (see the list of feature transformations in the section below).  When Feature Selection is enabled (by default), all the features for an input variable are ranked using the Symmetric Uncertainty statistic, and the top ranked features (per input) are selected and output from the nodeWhen disabled, all generated features are output from the node.  For Feature Selection, you specify the number of selected features per input with the Number of features per input option (Default=2). 

 

featureMachine3.png

 

The value that you specify for this option is compared against the feature rank values to determine the selected features.  If the ranking of features results in a tie (two or more features have the same SU value), this may result in more features being selected for an input than specified.  In the table below, which lists the generated features for Input Variable AGE, the third, fourth, and fifth features are tied at Feature Rank 3.  For this example, if the specified number of features per input is 1, the first feature is kept.  If the specified number of features is 2, the first two features are kept.  However, if the specified number of features is 3, the first five features are kept, since features 3, 4, and 5 all have Rank 3.  The Where clause for selecting the features is:  Where Feature Rank <= Number of features per input.

 

featureMachine4.png

 

 

Feature Transformations

Listed below is the set of available feature transformations, grouped by transformation policy.  This superset of feature transformations is the source for the naming of all features.  For additional information on these transformations, follow this link:  https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details23...  

 

Cardinality (Nominal to Interval transformations) 

  1. hc_tar_mean – Mean target encoding
  2. hc_tar_min – Minimum target encoding
  3. hc_tar_max – Maximum target encoding
  4. hc_tar_frq_rat – Frequency ratio target encoding
  5. hc_tar_woe – Weight of evidence target encoding
  6. hc_tar_evt_prob – Event probability target encoding
  7. hc_lbl_cnt – Level count rank
  8. hc_cnt – Level count
  9. hc_cnt_log – Level count followed by Log transformation 

 

Entropy, Qualitative variation 

  1. grp_rare1 – Mode imputation and group rare levels 
  2. grp_rare2 – Missing level and group rare levels
  3. lchehi_lab – Label encoding
  4. lcnhenhi_grp_rare – Group rare levels
  5. lcnhenhi_rtree5 – Five-bin regression tree binning
  6. lcnhenhi_rtree10 – Ten-bin regression tree binning
  7. lcnhenhi_dtree5 – Five-bin decision tree binning
  8. lcnhenhi_dtree10 – Ten-bin decision tree binning
  9. lcnhenhi_woe5 – Five-bin Weight of Evidence binning
  10. lcnhenhi_woe10 – Ten-bin Weight of Evidence binning 

 

Kurtosis 

  1. hk_yj – Yeo-Johnson power transformations with parameters -2, -1, 0, 1, 2
  2. hk_dtree_disct5 – Five-bin decision tree binning
  3. hk_dtree_disct10 – Ten-bin decision tree binning
  4. hk_rtree_disct5 – Five-bin regression tree binning
  5. hk_rtree_disct10 – Ten-bin regression tree binning 

 

Missingness 

  1. cpy_int_med_imp – Median imputation
  2. cpy_nom_mode_imp_lab – Mode imputation and Label encoding
  3. cpy_nom_miss_lev_lab – Missing level and Label encoding
  4. miss_ind – Missing indicator 

 

Outliers 

  1. ho_winsor - Winsorization
  2. ho_quan_disct5 – Five-bin quantile binning
  3. ho_quan_disct10 – Ten-bin quantile binning
  4. ho_dtree_disct5 – Five-bin decision tree binning
  5. ho_dtree_disct10 – Ten-bin decision tree binning
  6. ho_rtree_disct5 – Five-bin regression tree binning
  7. ho_rtree_disct10 – Ten-bin regression tree binning 

 

Skewness 

  1. hs_bc – Box-Cox power transformations with parameters -2, -1, 0, 1, 2
  2. hs_dtree_disct5 – Five-bin decision tree binning
  3. hs_dtree_disct10 – Ten-bin decision tree binning
  4. hs_rtree_disct5 – Five-bin regression tree binning
  5. hs_rtree_disct10 – Ten-bin regression tree binning 

 

Kurtosis, Outliers, Skewness (Low or Medium-rated values) 

  1. nhoks_nloks_pow, nhoks_nloks_log – Tukey's ladder of power transformations with parameters -2, -1, -0.5, 0, 0.5, 1, 2
  2. nhoks_nloks_dtree5 – Five-bin decision tree binning
  3. nhoks_nloks_dtree10 – Ten-bin decision tree binning
  4. nhoks_nloks_rtree5 – Five-bin regression tree binning
  5. nhoks_nloks_rtree10 – Ten-bin regression tree binning 

 

Kurtosis, Outliers, Skewness (Low-rated values)  

  1. all_l_oks_dtree5 – Five-bin decision tree binning
  2. all_l_oks_dtree10 – Ten-bin decision tree binning
  3. all_l_oks_rtree5 – Five-bin regression tree binning
  4. all_l_oks_rtree10 – Ten-bin regression tree binning 

 

 

Feature Machine Results

After running Feature Machine, click to open the results from the context pop-up menu.  When Feature Selection is enabled, the Selected Features report is displayed.  This contains the list of selected features, sorted by Input Variable, Feature Rank, and Feature, that are output by the node.  Downstream nodes will receive only these features.  The Description column, which describes the feature, includes the input variable followed by a colon and the data quality issue, which is then followed by a hyphen and the transformation method. The first feature in the example below has "AGE: Not high (outlier, kurtosis, skewness) - power(2) + impute(median)".  The expanded meaning:  This feature is for input variable AGE.  It addresses the data quality where one or more of Outlier, Kurtosis, and Skewness has a medium value, but none is high.  It's transformed by taking the Square and imputing the median value.

 

featureMachine6.png

 

When Feature Selection is disabled, the Generated Features report is displayed instead This contains the list of all generated features, which are output by the node for input into downstream nodes.  These are sorted by Input Variable and Feature. 

 

featureMachine7.png

 

Always displayed is the Output report, which contains the Generated Features print output.  This is a listing of all features generated by Feature Machine, even when Feature Selection is enabled.  When Feature Selection is disabled, this contains the same information as the Generated Features report. 

 

featureMachine8.png

 

 

Summary

In this article, I have given an overview of the new Feature Machine node in Model Studio on SAS Visual Data Mining and Machine Learning 8.5, explaining its functionality and how it works.  Here are the main points:

  • With this node, you automatically generate transformed features which address different data issues that negatively impact predictive modeling.
  • The number and type of these features are based upon the transformation policies that you select.
  • Variable screening is performed to exclude, from feature generation, variables that exceed certain data assessment thresholds.
  • With the Feature Selection option, you control how many features are selected per input variable.  Selected features are the top N ranked features, which are exported to downstream nodes.
Version history
Last update:
‎12-19-2019 12:04 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags