Automated Feature Engineering in SAS Model Studio on SAS Viya

1 Like

Properly engineering and selecting your features is as important as choosing and tuning your models. Good feature engineering can vastly improve your model results. But who has time for this? Automated feature engineering and selection available with SAS Visual Data Mining and Machine Learning is your friend!

The Data Science Pilot Action Set has 3 actions to help you out:

featureMachine
generateShadowFeatures
selectFeatures

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The Feature Machine Node The feature machine node in Model Studio transforms features to improve data quality and improve model accuracy. These transformations can fix:

high cardinality
low entropy
high kurtosis
missing values
outliers
low indices of qualitative variation
high skewness

Cardinality, missingness, and skewness are selected by default. You can leave those selected or deselect them, and add any combination of the others.

featureMachine action The featureMachine action works by:

Exploring the feature transformation and generation space
Executing the potentially effective transformation and generation operators

And it is highly efficient because it does all this without generating temporary data tables. The featureMachine action composes a sequence of operators to construct the feature transformation and generation processes that it executes. The processes are as follows:

Missing indicator
Mode imputation and group rare
Missing level and group rare
Median imputation
Mode imputation and label encoding
Missing level and label encoding
Yeo-Johnson transformation and median imputation
Box-Cox transformation
Quantile binning with missing bins
Decision (classification) tree binning
Regression tree binning
MDLP binning
Target encoding
Date, time, datetime transformations
Interaction features

Missing indicator generates binary missing indicator features from the input variables. It applies to variables whose missing rates:

exceed the value of the missingIndicatorPercent parameter and
are lower than the value of the missingPercent subparameter of the screenPolicy parameter

Mode imputation and group rare applies to nominal variables that have very low missing values, It imputes missing values with mode value. It also combines rare levels into a separate group called _OTHER_.

Missing level and label encoding applies to nominal variables whose missing rate makes them ineligible for mode imputation. It creates a new missing level and transforms the resulting variable by using the label encoding transformation. Label encoding assigns each label a unique integer based on alphabetical ordering.

Median imputation applies to all interval variables, except those that are converted to binary missing indicator variables.

Mode imputation and label encoding applies to nominal variables that have very low missing values. It imputes missing values using the mode. It transforms the resulting variable using the label encoding transformation, which assigns each label a unique integer based on alphabetical ordering.

Missing level and group rare applies to nominal variables whose missing rate makes them ineligible for mode imputation. It creates a new missing level and combines rare levels into a separate group called _OTHER_.

Yeo-Johnson transformation and median imputation is for interval variables with significant kurtosis.

It first transforms the variable using Yeo-Johnson transformation and then imputes using the median.

Box-Cox transformation is for interval variables with significant skewness.

It applies the Box-Cox transformation, then imputes missing values using the median.

Quantile binning with missing bins applies to interval variables with significant:

skewness
kurtosis, or
outliers

Decision (classification) tree binning and Regression tree binning are used for both interval and nominal input variables. Like quantile binning, they are used to address variables with significant skewness, kurtosis, or outliers. They are commonly applied to high cardinality nominal variables. The featureMachine action uses classification decision tree for classification problems and uses regression tree for regression problems.

MDLP binning applies a binning algorithm based on the minimum description length principle (MDLP). MDLP is a top-down supervised discretization technique. The conditions for applying it are similar to those for decision tree binning.

Target encoding is for variables with significant distinct counts. For regression problems, target encoding operators include:

mean
minimum
maximum

For classification problems, target encoding operators include:

frequency ratio
event probability
weight of evidence

Label count and input count encodings (and their log transformations) are applicable to both regression and classification problems. Date, time, datetime transformations are specific to date, time, and datetime variables.

Operators that extract a specific piece of information from these variables are applied. The operators include year, month, day, day of the week, day of the month, leap year (binary indicator), weekend (binary indicator), etc.

Interaction features are not generated by default. If the interaction subparameter of the transformationPolicy parameter is set to True, interaction features are generated from variable pairs with strong interactions. The operators include various target encoding operators, e.g.,

mean
weight of evidence
event probability
decision tree and regression tree binning

These operators act on the binned crossproduct (frequency table) of the interacting variables.

generateShadowFeatures Action

generateShadowFeatures uses shadow features to select relevant features. A shadow feature contains values of the original feature chosen at random. It relies on inverse sampling:

• Using empirical cumulative distribution for continuous variables

• Using empirical frequency distribution for nominal variables

This is accomplished in a single pass through the data. The nProbes parameter lets you specify the number of shadow features to generate per variable.

selectFeatures Action

The selectFeatures Action filters out features using a user-specified criterion (statistic). All the correlation statistics available in the exploreCorrelation action are also available here. If user picks a filter criterion that is not applicable to any of the inputs, then the default statistic—mutual information—is used.

Conclusion

So what is stopping you? Jump in and take advantage of these automated SAS tools and improve your feature engineering!

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library