Step-by-step guide for using Open-Source models in SAS Visual Forecasting

8 Likes

SAS Visual Forecasting (VF) can distribute open-source code (Python and R) to run in parallel in the cloud, in the same nodes that SAS Viya is installed. In that way you can easily scale up forecasting processes which are developed in open-source (OS) to millions of series and move away from ungoverned, inconsistent, and error-prone processes. To understand how the parallelization is achieved in more detail you can have a look at the paper here. This article is a technical end-to-end guide on how to set up SAS Visual Forecasting to integrate with OS, how you should go about developing and incorporating OS code inside SAS VF code and finally how you could achieve this integration even from the UI environment and compare OS and SAS models at a series-by-series level while also taking advantage the interactive exploration capabilities of SAS VF’s UI. The steps to achieve what we just described are as follows:

Deployment and Administration
1. Configure open-source languages
2. Install packages
3. Configure EXTLANG
Model Development
1. Write the open-source code file
2. Integrate the open-source code in PROC TSMODEL
Integration in Model Studio
1. Modify existing VF forecast code node in UI
2. Customize the node

I. Deployment and Configuration

Admins should perform the steps below:

Step 1 - Open-Source languages configuration

In SAS Viya 4, Python and R must be made accessible in the environment via persistent volumes by the Kubernetes administrator. Further details on the integration of Python/R environment can be found in the deployment files:

Configuring Python for Viya 4: ($deploy/sas-bases/examples/sas-open-source-config/python/README.md)
Configuring R for Viya 4: ($deploy/sas-bases/examples/sas-open-source-config/r/README.md)

Step 2 - Install packages

Install the required packages needed for your project on the Python or R volumes created in the previous step. When installing the packages for the first time you should consider popular pre-configured environments that should cover most forecasting needs. In that way, data scientists could start developing directly on the environment provided to them. However, the idea is for data scientists to first experiment with new OS forecasting packages locally on a small scale and when they find an OS algorithm that looks promising, then they will have to pass the package needed to admins for further validation, prior of pushing it to the server. In that way, the best balance of governance and control is achieved between data scientists and IT.

Step 3 - Configure EXTLANG for external languages execution in CAS

The external languages (EXTLANG) package provides objects that enable the integration of external-language programs into SAS environments. The EXTLANG package supports Python (versions 2.6.6–2.7.7 and 3.3 and higher) and R (versions 3.2.5 and higher). Finally, the objects in this package enable you to specify which variables should be shared between the two environments. To configure EXTLANG:

Prepare the XML configuration file for enabling Python and R integration in CAS. This configuration file allows to:

specify paths to the external languages’ executables
Enable loading source code from files (with diskAllowlist setting)

<EXTLANG version="1.0" mode="ANARCHY" allowAllUsers="ALLOW">
   <DEFAULT scratchDisk="/tmp" diskAllowlist="/">
       <LANGUAGE name="R" interpreter="/R/R-4.0.2/lib64/R/bin/Rscript"> 
           <ENVIRONMENT name="LD_LIBRARY_PATH" value="/R/R-4.0.2/lib64"> 
           </ENVIRONMENT> 
       </LANGUAGE>
       <LANGUAGE name="PYTHON3" interpreter="/python/miniconda3/envs/forecast_scb/bin/python"> </LANGUAGE>
   </DEFAULT>
</EXTLANG>

More details about the options here External Languages Access Control Configuration

Set the SAS_EXTLANG_SETTINGS environment variable in kustomization.yml in $deploy/site-config/sas-open-source-config/python OR $deploy/site-config/sas-open-source-config/r to point to the path of the XML file you just created. The path must be accessible to the CAS controller(s) and all CAS workers.

configMapGenerator:
- name: sas-open-source-config-r
literals:
#- DM_RHOME=/R/R-4.0.2/lib64/R/
- SAS_EXTLANG_SETTINGS=/R/R-4.0.2/sas/extlang_config-r-and-python.xml
#- SAS_EXT_LLP_R=/R/R-4.0.2/lib64/R/lib/

Note: You only need to set SAS_EXTLANG_SETTINGS for EXTLANG to work. The other variables affect different SAS products.

II. Model development

Data scientists should perform the steps below:

1. Write the open-source code file

We start the model development by writing the open-source code we want to use for the forecast. Here we use the Prophet algorithm in Python.

Important note: this code will be executed one time for each time series in our dataset. SAS Visual Forecasting will handle the distribution of the processing.

Data inputs:

Y: Our dependent variable array of values (for one time series)
DS: the date/datetime values

Parameters:

NFOR: Number of steps in the time series
HORIZON: the number of steps to predict

Output:

PRED: array of predicted values (of size NFOR + HORIZON)

from prophet import Prophet
import pandas as pd

# init DataFrame
df = pd.DataFrame({'ds': DS, 'y': Y}) 
# convert sas dates to python dates
df.ds = pd.to_timedelta(df.ds, unit='s') + pd.Timestamp('1960-1-1')

# Prophet Fit/Predict
m = Prophet()
m.fit(df.iloc[:(int(NFOR) - int(HORIZON))])
future = m.make_future_dataframe(periods=int(HORIZON))
forecast = m.predict(future)

# Output
PRED = np.array(forecast['yhat'])

2. Integrate the open-source code in PROC TSMODEL

The next step is to call this code file from within the SAS Visual Forecasting TSMODEL procedure, which will allow you to:

perform data preparation and accumulation prior of passing the data to run the OS code to take advantage of the automatic capabilities of SAS VF’s procedure
distribute the processing of the OS algorithm run in the distributed in-memory compute engine of SAS Viya

We can use this Python function in SAS code with PROC TSMODEL or via the Time Series Processing Action Set, which is callable from Python and R languages as well, using the included PYTHON2, PYTHON3, and R objects. Here we will use PROC TSMODEL since we need to use SAS code in order to integrate it as a node in Model Studio. For this example, we use the PYTHON3 object, which allows us to interact with the Python interpreter specified in the <LANGUAGE name="PYTHON3"> section of the XML file. The first step is to initialize the object.

declare object py(PYTHON3);
rc = py.Initialize();

There are 3 methods to specify the open-source code you want to run from PROC TSMODEL:

Line by line

rc = py.PushCodeLine("w = np.ones(7)/7"); 
rc = py.PushCodeLine("nans = np.empty(6) ; nans[:] = np.nan");
rc = py.PushCodeLine("y_p = np.concatenate((nans,Y))");
rc = py.PushCodeLine("MAVG = np.convolve(y_p, w, mode='valid')");

Using a file path

rc = py.PushCodeFile('/shared/python_mavg_code.py');

using a CAS table that contains the code

rc = py.PushCodeFromTable(INEXTCODE_Object, Name);

The most convenient method is the second one and we will use it in our example. This method requires to configure the diskAllowlist setting in the EXTLANG configuration file to be able to access the file system. (See the Deployment and Configuration part)

We also need to do the mapping between the column names in the input dataset and the parameters used in the python code, as shown in the following code snippet:

*mapping variables, parameters and columns names;
rc = py.AddVariable(Revenue, 'ALIAS', 'Y') ;
rc = py.AddVariable(SAS_DATE, 'ALIAS', 'DS') ;
rc = py.AddVariable(PRED, "READONLY", "FALSE") ;
rc = py.AddVariable(_LENGTH_, 'ALIAS', 'NFOR') ;
rc = py.AddVariable(_LEAD_,'ALIAS','HORIZON') ;

*load the python file;
rc = py.PushCodeFile('/files/python_prophet_code.py') ;

We will also declare two additional objects OUTEXTLOG and OUTEXTVARSTATUS, for storing execution logs and variables statuses, respectively.

declare object pylog(OUTEXTLOG) ;
rc = pylog.Collect(py, 'EXECUTION') ;
declare object pyvars(OUTEXTVARSTATUS) ;
rc = pyvars.collect(py) ;

This will generate two output tables containing precious information for debugging. After code execution, we check whether the code was executed successfully or not. In the OUTEXTLOG object, we verify that all exit codes (_EXITCODE_) are equal to 0. If there are execution errors, the logs are available in the _LOGTEXT_ column.

The UPDATED variable in OUTEXTVARSTATUS object allows to verify that the variables were modified by the external-language program.

III. Integration in Model Studio

Data scientists can also integrate OS in Model Studio which is the UI environment for SAS Visual Forecasting. In that way, they take advantage of the automatic exploration capabilities of the UI and can compare and select the best algorithm from SAS and OS for their forecasting needs automatically at a series-by-series level. What is needed for this to be achieved is described below:

1. Modify an existing VF forecast node code

The first step is to create a forecasting pipeline and add a “Naïve Model” or an “Auto-Forecasting” node. We can then modify the code of this node via the “Open” Code Editor button. There are two options here. You can either develop a ‘pure’ open-source node where only the open-source code is run and then compare the overall results with SAS forecasting methodologies, or you can incorporate the OS code to compete directly with SAS algorithms at a series-by-series level.

For the second option, to make this article more digestible we will not discuss the code changes in detail but this blog discusses this process we need to follow to incorporate deep learning models into our VF pipelines. The process you would have to follow is the same instead of the deep learning part of the code you would incorporate your OS code and then pass it to subsequent steps using the EXMSPEC object (following the exact same way as it is described in the blog we mentioned before).

2. Customize the node [Optional]

We can also develop our own custom OS nodes and make them available around the business to be applied in different use-cases in a consistent manner. For more information on how to do that please have a look at this resource.

Attached to this article, an example of a packaged node.

Final Thoughts

In this step-by-step guide we saw how we can incorporate open-source time series algorithms in our forecasting processes using SAS VF. The benefits include:

Distribute open-source code to scale up to millions of series
Compare the performance with other algorithms and pick the most accurate
Use the Interactive exploration capabilities of SAS VF to further analyse the results of OS and SAS algorithms.

The process we described may seem long for the first time but when you set it up once, it is straightforward how to apply it again and develop a framework of incorporating new algorithms and enhancing your forecasting process in a robust and consistent way. Happy forecasting!

References

- Common Pitfalls in Using the EXTLANG Package

- System-Defined Macros for a better understanding of the macros used in the code

- How to incorporate Recurrent Neural Networks in your SAS Visual Forecasting pipelines process of modifying default nodes

- Writing a Gradient Boosting Model Node for SAS® Visual Forecasting explains how to customize a node's UI