BookmarkSubscribeRSS Feed

Automated Feature Engineering in SAS Model Studio on SAS Viya

Started ‎07-09-2021 by
Modified ‎07-09-2021 by
Views 4,319

Properly engineering and selecting your features is as important as choosing and tuning your models.  Good feature engineering can vastly improve your model results. But who has time for this? Automated feature engineering and selection available with SAS Visual Data Mining and Machine Learning is your friend!

 

The Data Science Pilot Action Set has 3 actions to help you out:

 

  • featureMachine
  • generateShadowFeatures
  • selectFeatures

 

be_1_image001.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The Feature Machine Node The feature machine node in Model Studio transforms features to improve data quality and improve model accuracy. These transformations can fix:

 

  • high cardinality
  • low entropy
  • high kurtosis
  • missing values
  • outliers
  • low indices of qualitative variation
  • high skewness

 

Cardinality, missingness, and skewness are selected by default. You can leave those selected or deselect them, and add any combination of the others.

 

be_2_image003.png

 

featureMachine action The featureMachine action works by:

 

  • Exploring the feature transformation and generation space
  • Executing the potentially effective transformation and generation operators

 

And it is highly efficient because it does all this without generating temporary data tables. The featureMachine action composes a sequence of operators to construct the feature transformation and generation processes that it executes. The processes are as follows:

 

  • Missing indicator
  • Mode imputation and group rare
  • Missing level and group rare
  • Median imputation
  • Mode imputation and label encoding
  • Missing level and label encoding
  • Yeo-Johnson transformation and median imputation
  • Box-Cox transformation
  • Quantile binning with missing bins
  • Decision (classification) tree binning
  • Regression tree binning
  • MDLP binning
  • Target encoding
  • Date, time, datetime transformations
  • Interaction features

 

Missing indicator generates binary missing indicator features from the input variables. It applies to variables whose missing rates:

 

  • exceed the value of the missingIndicatorPercent parameter and
  • are lower than the value of the missingPercent subparameter of the screenPolicy parameter

 

 

be_3_image005.png

 

Mode imputation and group rare applies to nominal variables that have very low missing values, It imputes missing values with mode value. It also combines rare levels into a separate group called _OTHER_.

 

Missing level and label encoding applies to nominal variables whose missing rate makes them ineligible for mode imputation. It creates a new missing level and transforms the resulting variable by using the label encoding transformation. Label encoding assigns each label a unique integer based on alphabetical ordering.

 

be_4_image007.png

 

Median imputation applies to all interval variables, except those that are converted to binary missing indicator variables.

 

be_5_image009.png

 

Mode imputation and label encoding applies to nominal variables that have very low missing values. It imputes missing values using the mode. It transforms the resulting variable using the label encoding transformation, which assigns each label a unique integer based on alphabetical ordering.

 

Missing level and group rare applies to nominal variables whose missing rate makes them ineligible for mode imputation. It creates a new missing level and combines rare levels into a separate group called _OTHER_.

 

Yeo-Johnson transformation and median imputation is for interval variables with significant kurtosis.

 

be_6_image011.png

 

It first transforms the variable using Yeo-Johnson transformation and then imputes using the median.

 

be_7_image013.png

 

Box-Cox transformation is for interval variables with significant skewness.

 

be_8_image015.png

 

It applies the Box-Cox transformation, then imputes missing values using the median.

 

be_9_image017.png

 

Quantile binning with missing bins applies to interval variables with significant:

  • skewness
  • kurtosis, or
  • outliers

be_10_image019.png

 

Decision (classification) tree binning and Regression tree binning are used for both interval and nominal input variables. Like quantile binning, they are used to address variables with significant skewness, kurtosis, or outliers.  They are commonly applied to high cardinality nominal variables. The featureMachine action uses classification decision tree for classification problems and uses regression tree for regression problems.

 

MDLP binning applies a binning algorithm based on the minimum description length principle (MDLP). MDLP is a top-down supervised discretization technique. The conditions for applying it are similar to those for decision tree binning.

 

Target encoding is for variables with significant distinct counts. For regression problems, target encoding operators include:

 

  • mean
  • minimum
  • maximum

 

For classification problems, target encoding operators include:

 

  • frequency ratio
  • event probability
  • weight of evidence

 

Label count and input count encodings (and their log transformations) are applicable to both regression and classification problems. Date, time, datetime transformations are specific to date, time, and datetime variables.

 

be_11_image021.png

 

Operators that extract a specific piece of information from these variables are applied. The operators include year, month, day, day of the week, day of the month, leap year (binary indicator), weekend (binary indicator), etc.

 

Interaction features are not generated by default. If the interaction subparameter of the transformationPolicy parameter is set to True, interaction features are generated from variable pairs with strong interactions. The operators include various target encoding operators, e.g.,

 

  • mean
  • weight of evidence
  • event probability
  • decision tree and regression tree binning

 

These operators act on the binned crossproduct (frequency table) of the interacting variables.

 

generateShadowFeatures Action

generateShadowFeatures uses shadow features to select relevant features. A shadow feature contains values of the original feature chosen at random. It relies on inverse sampling:

 

• Using empirical cumulative distribution for continuous variables

• Using empirical frequency distribution for nominal variables

 

This is accomplished in a single pass through the data. The nProbes parameter lets you specify the number of shadow features to generate per variable.

 

selectFeatures Action 

The selectFeatures Action filters out features using a user-specified criterion (statistic). All the correlation statistics available in the exploreCorrelation action are also available here. If user picks a filter criterion that is not applicable to any of the inputs, then the default statistic—mutual information—is used.

 

be_12_image023.png

 

Conclusion

So what is stopping you? Jump in and take advantage of these automated SAS tools and improve your feature engineering!

 

be_13_image025.png

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎07-09-2021 11:33 AM
Updated by:
Contributors

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags