Tips for Using the Open-Source Code Node as an Open-Source User

3 Likes

The Open-Source Code Node (OSCN) in Model Studio allows for Python & R models to be incorporated alongside SAS models in a single model pipeline; the benefit of this being to allow Python, R & SAS models to be trained, scored & evaluated side by side in a single interface. Since the OSCN is not designed as a replacement of standard interactive development environments (IDEs) like Jupyter Notebook or RStudio, it is expected (and even encouraged!) that open-source users write their code initially in their preferred IDE before copying and pasting into the OSCN. However, because of the differences between using a desktop based IDE compared to the OSCN, which is server based along with its own functional requirements, open-source users will have to make adjustments to their scripts as they move their code.

This article will provide open-source users who are using the OSCN with best practice to use it including how it works, and tips and tricks for how to use the OSCN.

How the Open-Source Code Node Works

Let's start by understanding how the OSCN works - with two key concepts:

Data Movement

Firstly, data exists in-memory on Viya as a CAS Table. When an OSCN is run, the data is pushed out of memory from Viya and sent to the open-source environment as a DataFrame or CSV. This is because SAS Viya doesn't execute Python/R but requires a Python/R environment to be configured on the server (see the documentation for how to configure). Upon successful execution, results can then be sent back into Viya for assessment & evaluation of the model.

What does this mean?

1. Data is pushed out from CAS as a CAS Table and submitted as either a DataFrame or CSV. Because the DataFrame is the default behavior, the environment admins should ensure that pandas is installed. The option to submit as a DataFrame can be overridden in the node options, where by default this is checked as "Generate data frame".

2. Because of the data movement, from CAS to the open-source environment, by default the node will sample only 10,000 observations as a measure to stop potentially time-consuming, large movement of data. This setting can be overridden in the node options specified under "Data Sample".

Variables & Variable Naming

Secondly, the OSCN uses specific variables in order for the node to successfully execute and return the results to Viya. For example, the output scores for a model uses a specific variable name, dm_scoreddf, meaning that the scores must be stored as this variable to be returned back to Viya for model comparison. Therefore, when moving code written in an IDE into Model Studio, ensure that all the relevant variables are renamed according to what's required.

Generally, the variables to rename to ensure the OSCN works are:

Input dataframe to dm_inputdf
Input training partition to dm_traindf (if data is partitioned)
Output dataframe containing scores to dm_scoreddf
The column names of the output dataframe use dm_predictionvar

For an exhaustive list for variable names and explanations, please refer to the documentation. In the node user interface, the OSCN variables are listed on the left and can be highlighted over to indicate what they are used for as well:

Tips for Using the Open-Source Code Node

Now that we understand the basics of how the OSCN works, here are some tips and tricks for using the OSCN. It's important to note that, because of how the OSCN works, there are a couple of conveniences relating to data manipulation that the OSCN performs to make writing code easier in some respects.

With the functional requirements above, this forms the basics of a checklist that any open-source user can use when moving their code from their IDE into Model Studio.

Loading Data

Since data is already defined in-memory on Viya, there is no need to load the DataFrame locally using pandas which generally is required when working in the IDE.

Specifically, statements such as the following are no longer required:

import pandas as pd
dm_inputdf = pd.read_csv('/home/user/inputdata.csv')

Since input DataFrame is already defined as dm_inputdf and the pandas module itself is imported automatically - statements like the above should be removed in order for code to work in the OSCN.

Input Data Definition

Since the DataFrame is defined automatically, manipulating the data right away is fairly straightforward. Specifically, the input DataFame gets created automatically from the metadata set in the Data tab of Model Studio, and if specified in the project settings, the partitioning as well.

Every time an OSCN is executed, there is pre-processing code that is automatically run that defines the input DataFrame that is sent to the Python/R environment, which can be seen by in the Results tab of a successfully run OSCN:

Clicking this will show how the input DataFrame is defined, with what variables are designated as class compared to interval, what the target variable is, and so on:

As a user, you still have full control over manipulating the input data and creating what you wish, but there is added convenience because the definition of your input data and even the train/test sample is done for you. Changes made in the the Data tab to change the input data metadata, will change the input DataFrame definition. The project settings can also be changed to adjust the partition:

Error Diagnosis

What happens if your code doesn't execute successfully and you need to see the error log? The Results tab in Model Studio does output any statements from the Python/R interpreter, including errors.

To find it, right click the Log tab:

The error message should state "Encountered error code 1 when executing [the open-source] program". Click on that error message:

This will move the view further down the log. At this point, scroll the log up until you see the open-source output. This is commonly identified by looking for text containing "SASJavaExec", which is the process SAS Viya uses to communicate with the open-source environment.

From the log, we can see that it was a syntax NameError, with the variable dm_interval_input spelt wrong.

Prepending Python Code

For Python code only, there is an option in Model Studio to preprend custom Python code. That is, if there is a custom Python code you wish to run before every Python code (e.g., printing the Python version information) this can be configured in the Project settings as well:

Conclusion

My goal with this blog post was to provide help for any open-source user using the OSCN for the first time, or any others working with the OSCN. Any feedback or additional suggestions are much appreciated! Thanks for reading!