Properly engineering and selecting your features is as important as choosing and tuning your models. Good feature engineering can vastly improve your model results. But who has time for this? Automated feature engineering and selection available with SAS Visual Data Mining and Machine Learning is your friend!
The Data Science Pilot Action Set has 3 actions to help you out:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The Feature Machine Node The feature machine node in Model Studio transforms features to improve data quality and improve model accuracy. These transformations can fix:
Cardinality, missingness, and skewness are selected by default. You can leave those selected or deselect them, and add any combination of the others.
featureMachine action The featureMachine action works by:
And it is highly efficient because it does all this without generating temporary data tables. The featureMachine action composes a sequence of operators to construct the feature transformation and generation processes that it executes. The processes are as follows:
Missing indicator generates binary missing indicator features from the input variables. It applies to variables whose missing rates:
Mode imputation and group rare applies to nominal variables that have very low missing values, It imputes missing values with mode value. It also combines rare levels into a separate group called _OTHER_.
Missing level and label encoding applies to nominal variables whose missing rate makes them ineligible for mode imputation. It creates a new missing level and transforms the resulting variable by using the label encoding transformation. Label encoding assigns each label a unique integer based on alphabetical ordering.
Median imputation applies to all interval variables, except those that are converted to binary missing indicator variables.
Mode imputation and label encoding applies to nominal variables that have very low missing values. It imputes missing values using the mode. It transforms the resulting variable using the label encoding transformation, which assigns each label a unique integer based on alphabetical ordering.
Missing level and group rare applies to nominal variables whose missing rate makes them ineligible for mode imputation. It creates a new missing level and combines rare levels into a separate group called _OTHER_.
Yeo-Johnson transformation and median imputation is for interval variables with significant kurtosis.
It first transforms the variable using Yeo-Johnson transformation and then imputes using the median.
Box-Cox transformation is for interval variables with significant skewness.
It applies the Box-Cox transformation, then imputes missing values using the median.
Quantile binning with missing bins applies to interval variables with significant:
Decision (classification) tree binning and Regression tree binning are used for both interval and nominal input variables. Like quantile binning, they are used to address variables with significant skewness, kurtosis, or outliers. They are commonly applied to high cardinality nominal variables. The featureMachine action uses classification decision tree for classification problems and uses regression tree for regression problems.
MDLP binning applies a binning algorithm based on the minimum description length principle (MDLP). MDLP is a top-down supervised discretization technique. The conditions for applying it are similar to those for decision tree binning.
Target encoding is for variables with significant distinct counts. For regression problems, target encoding operators include:
For classification problems, target encoding operators include:
Label count and input count encodings (and their log transformations) are applicable to both regression and classification problems. Date, time, datetime transformations are specific to date, time, and datetime variables.
Operators that extract a specific piece of information from these variables are applied. The operators include year, month, day, day of the week, day of the month, leap year (binary indicator), weekend (binary indicator), etc.
Interaction features are not generated by default. If the interaction subparameter of the transformationPolicy parameter is set to True, interaction features are generated from variable pairs with strong interactions. The operators include various target encoding operators, e.g.,
These operators act on the binned crossproduct (frequency table) of the interacting variables.
generateShadowFeatures Action
generateShadowFeatures uses shadow features to select relevant features. A shadow feature contains values of the original feature chosen at random. It relies on inverse sampling:
• Using empirical cumulative distribution for continuous variables
• Using empirical frequency distribution for nominal variables
This is accomplished in a single pass through the data. The nProbes parameter lets you specify the number of shadow features to generate per variable.
selectFeatures Action
The selectFeatures Action filters out features using a user-specified criterion (statistic). All the correlation statistics available in the exploreCorrelation action are also available here. If user picks a filter criterion that is not applicable to any of the inputs, then the default statistic—mutual information—is used.
Conclusion
So what is stopping you? Jump in and take advantage of these automated SAS tools and improve your feature engineering!
Find more articles from SAS Global Enablement and Learning here.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.