Class Variable One-Hot Encoding in SAS Visual Data Mining and Machine Learning

5 Likes

You are building a pipeline in SAS Visual Data Mining and Machine Learning (VDMML), and you want to perform class variable one-hot encoding so as to make your individual class levels available for analytical modeling (also known as creating dummy variables or class level indicators). How can this be done? This tip describes how you can accomplish this.

I first start by illustrating the one-hot encoding process with mocked-up data, then I continue by showing how the hot encoding process is accomplished in SAS Visual Data Mining and Machine Learning. Finally, I describe details on how the class level variables are generated, and the available options for affecting that process.

Class Variable One-Hot Encoding - Mocked-up Data

The data I'm demonstrating here is mocked-up demographic data. This small sample of data has four class variables with two levels each. Below are the variables and their levels:

Gender: Female, Male
Race: Black, White
Marital: No, Yes
Edlevel: College, High School

When running the One-Hot Encoding routine, the class variables are broken out into their individual levels, as follows:

gender --> gender_Female, gender_Male

race --> race_Black, race_Female

marital --> marital_No, marital_Yes

edlevel --> edlevel_College, edlevel_High school

Individual class level variables are generated and populated with values of 0 or 1. The resulting data now contains 8 additional variables:

Below is the typical DATA step score code that would be generated for the Gender and Edlevel class variables:

Class Variable One-Hot Encoding - SAS Visual Data Mining and Machine Learning

Login to Model Studio (SAS Visual Data Mining and Machine Learning) and create a project, selecting your desired data. The example illustrated here is home equity data. There are five input variables that are class variables (highlighted in yellow).

SAS Model Studio 8.x on SAS Viya 3, SAS Model Studio on SAS Viya 2020.1.1 - 2021.1.1: On the Pipelines tab for the project, add the SAS Code node to the Data node in the pipeline (note, in the 8.2 release of Model Studio, it was called the Code node). The SAS Code node is found in the Miscellaneous group. Click on the SAS Code node.

Note the "Train only data" checkbox in the properties for the node. In the Model Studio 8.2 release, this checkbox is not functional. Ordinarily, if the data is partitioned and you only want the training data, selecting the "Train only data" checkbox would provide just the training data to this node, with the end result of excluding class levels that are not in the training data. There is still a way to use only your training data in the 8.2 release, which I address later in this article.

Note the "Code editor" Open button in the properties for the node. The Code editor provides the ability to enter and run any custom SAS code against your source data. Click the Open button. Copy and paste the provided SAS code into the SAS Code entry window (or Training Code entry window in later releases) that opens up. The GitHub link to this code is provided at the end of this article.

This code is required for Model Studio 8.1, 8.2, and 8.3 releases. Starting with the Model Studio 8.4 (Viya 3.4) release, all that's necessary is to enter the macro invocation code below. Documentation on the parameters for macro %dmcas_classlevs is provided near the end of this article.

%Include "&sourcefolder.&dm_dsep.dmcas_classlevs.sas";
%dmcas_classlevs(p_dummyvarrole=INTERVAL, p_maxnamelen=32)

Click the Save icon and the Close button to exit the SAS Code editor.

SAS Model Studio on SAS Viya 2021.1.2 and later: One-hot encoding code has been incorporated into the Transformations node, eliminating the need for using the SAS Code node: On the Pipelines tab for the project, add the Transformations node to the Data node in the pipeline. The Transformations node is found in the Data Mining Preprocessing group. Click on the Transformations node, and, within the Class Inputs group, select "One-hot encoding" for the Default class inputs method.

Click to run the pipeline. When the pipeline finishes executing, right click the node (SAS Code or Transformations) and select Results. Two items of note in the results:

The Node Score Code contains the SAS code used to generate the Class Level Indicator variables:

The Output contains the Class Variable Mapping information for all Class input variables:

Exit the Results by clicking the Close button.

The purpose of generating the Class Level Indicator variables is to replace the original Class variables, using the indicator variables as inputs in downstream nodes of the pipeline. As such, the original Class variables still exist in the data, but they are flagged as rejected and are not used by succeeding nodes. To illustrate, add a Decision Tree modeling node (found in the Supervised Learning group) after the node that was just run (SAS Code or Transformations), and run the pipeline to execute this node.

When complete, open the results for the Decision Tree node. Expand the Variable Importance item. Note that this contains Class Level Indicator variables, rather than the original Class variables:

Class Variable One-Hot Encoding - Additional Details

The source code generates the class level indicators (values of 0 or 1) for all class variables identified in metadata. The class level indicator variable names are derived as <ClassVariableName>_<ClassLevel>. If a derived variable name is greater than the Maximum name length (defaults to 32), the class level part of the variable name is trimmed down to bring the name to the maximum or less. Note that SAS supports a name with a length no greater than 32 bytes. Any duplicates in generated class level names are resolved by using the generic name _CLASSLEVn (_CLASSLEV1, _CLASSLEV2, etc.) for the duplicates.

The source code defines the SAS Macro %dmcas_classlevs. For macro parameters that are defined on this macro, parameter p_trainonly allows you to run the node with just the training data in Model Studio 8.2. The macro call which includes the three parameters (with default values) is shown below.

%dmcas_classlevs(p_dummyvarrole=INTERVAL, p_maxnamelen=32, p_trainonly=NO)

Macro parameter descriptions:

p_dummyvarrole: Specifies the level for the Class Level variables. Possible values are INTERVAL or BINARY. If INTERVAL, the Class Level variables are populated with numeric values 0 or 1, and the variables have Level of INTERVAL. If BINARY, they are populated with character values '0' or '1', and the variables have a Level of BINARY. Defaults to INTERVAL if blank.

p_maxnamelen: Specifies the maximum variable name length for the generated Class Level variables. Currently SAS supports a variable name length no greater than 32 bytes. Defaults to 32 if blank.

p_trainonly: Specifies whether all data is used for determining the Class variable levels, or just the Training data. Possible values are YES or NO. If YES, Training data is used to determine the Class variable levels in the data. If NO, all data is used. Defaults to NO if blank.

Class Variable One-Hot Encoding - Where Can I Get the Code?

The SAS program to generate Class Level Indicator variables can be accessed from the Github repository:

Download the SAS code (GitHub)