Tip: How to include the SAS Code node in SAS® Visual Data Mining and Machine Learning's Model Studio

2 Likes

The SAS Code node can be a very powerful tool to include in your SAS Model Studio pipeline in a SAS Data Mining and Machine Learning project. It allows you to insert SAS code into your pipeline to tailor the data mining and machine learning process to your needs. The code editor inside the SAS Code node includes macros and macro variables that are available to represent different elements of your project programmatically in your code. A few examples are the name of the target, lists of variables (interval inputs, class inputs, and so on), the name of data sets and CAS tables, and score code files. And the syntax coloring and auto-complete feature of the code editor are nice bonuses!

The SAS Code node is listed in the Miscellaneous section of nodes, and there are several ways you can incorporate it into your pipeline:

as a Data Mining Preprocessing node that creates score code and/or modifies the metadata to pass to a subsequent node
as a Supervised Learning node that creates either score code or a scored data set containing predictions. This model can then be assessed and compared with other predictive models, as well as explained via model interpretability methods.
as a terminal node to use for data summary, visualization, etc.

To demonstrate each of these, I will point to examples in our GitHub repository.

Files on GitHub

Data Mining Preprocessing node

Let’s first look at creating score code and/or modifying the metadata to be used in a subsequent Data Mining Preprocessing or Supervised Learning node.

In a Model Studio pipeline, the data you are using in your project can only be modified by creating new variables via score code; you can’t modify the data directly. This is true for tasks from feature engineering to filtering or subsetting your data. You can also change metadata for any variable in your project data other than your target. Metadata is information about the variables in your data, including the variable’s role, measurement level, order (for class variables), transformation, imputation, and filtering/replacement limits, and this information is all represented in a data set that has one row per each variable. You can make changes to the metadata directly in the Data tab, a Manage Variables node, or also by using the SAS Code node as shown in the examples below.

One common preprocessing task that can be performed in the SAS Code node is excluding a subset of observations from training. In the subset_data folder of the repository, there is example code that you can use in your SAS Code node that writes out the score code to create a filter_flag variable with values 1 (indicating the observations to be filtered out, that is, excluded) and 0 (for observations to include for training).

The subset_data example excludes observations based on the values of one of the inputs, though this can be tweaked for filtering in other ways to suit your needs. Then the dmcas_metaChange macro is used to set the role of this variable to be FILTER and its level to be BINARY. This tells subsequent nodes to treat this variable as a filter based on each observation’s value (1 or 0) as described above. Note that the dmcas_metaChange macro can also be used to set the level, transformation, imputation, etc. for your variables.

For you advanced SAS programmers, an alternative to using the dmcas_metaChange macro is to write code directly to the file represented by the macro variable dm_file_deltacode to modify the metadata data set – this enables you to do something like the code in the log_transform_for_skewed_inputs example on GitHub that shows how to programmatically change the metadata for particular variables via a SAS DATA step. This example sets the transformation (represented by the TRANSFORM column of the metadata) for interval inputs with a skewness over a certain threshold to Log, which would be applied in a subsequent Transformations node.

The SAS file in the class_level_indicators folder is another example of code for creating new variables, this time for inputs, and setting their metadata. While SAS analytics intrinsically handle class or categorical variables, typically by creating dummy variables or class-level indicators under the covers – also known as one-hot encoding – there could be other reasons for wanting to work with these class-level indicators. You might be using the Open Source Code node for example and need to perform one-hot encoding of your class inputs for Python or R. This code would go into a SAS Code node in the Data Mining Preprocessing lane of your pipeline (added after either the Data node, or another Data Mining Preprocessing node). By default, it sets the original inputs that are being encoded as rejected, and the new class level indicators are used as inputs in their place.

Note that if your data is partitioned and you only want to use the training data for creating your score code or basing decisions for any metadata changes, you can check the “Training data only” property of the SAS Code node. Then the dm_data macro variable will represent only the training partition of your data.

Supervised Learning node

Now let’s explore using the SAS Code node as a Supervised Learning node.

The proc_samples folder of the GitHub repository contains a few examples for building a predictive model in a SAS Code node. If your SAS code generates score code for the predictive model, you can move the SAS Code node to a Supervised Learning node by right-clicking on it and selecting Move > Supervised Learning. Now after you have entered your code and saved it in the node, assessment and model comparison is performed automatically when you run the node.

The score code can be in one of two formats: a DATA step score code file or an analytic store. The proc_logselect.sas file shows how to use the dm_file_scorecode macro variable to represent your DATA step score code file, and proc_gradboost.sas shows how to use the dm_data_rstore macro variable to represent your analytic-store file. When you run the SAS Code node with the code from either of these examples (note that you must use a binary target for the LOGSELECT example), you will see the assessment statistics that are provided to you in the results of the node: lift plots, ROC plots, and fit statistics for a class target, or prediction plots and fit statistics for an interval target.

The model will also be included in the results of the Model Comparison node for evaluating its performance against other supervised learning models. Additionally, you can include reports for explaining model inputs and/or predictions at a cluster level for your model by selecting the various model interpretability properties that are available in the SAS Code node as in other Supervised Learning nodes.

Terminal node

Finally, you might want to use the SAS Code node purely for data exploration, summary, and/or visualization, as a terminal node.

You can use your favorite SAS procedures to create ODS output or use the dmcas_report macro to create your own plots of data sets that are in the SAS library represented by the dm_lib macro variable. This enables you to create tailored reports in your results such as bar charts, series plots, pie charts, scatter plots, and tables. The SAS code provided in the cluster_profiling folder of the GitHub repository has examples for creating several of these report types.

Hope these examples have been helpful to get you started with the SAS Code node to perform custom analyses, tasks, etc. that extend the functionality of Model Studio.

More resources

SAS Code node examples
SAS Code details