Creating ModelOps process for Tensorflow model and RedHat OpenShift using SAS Viya

5 Likes

This post is co-authored with my colleague Ivan Nardini

Machine learning adoption in business has increased hugely over the past years. However, many organizations find it difficult to get a return on their investment from Data Science initiatives. This is because of the operationalization barrier, the so-called ‘last mile problem’, and lack of efficient model governance across the whole model lifecycle. The abundance of separate development and deployment frameworks provides an additional challenge in scaling analytics across the whole company. This is where a ModelOps approach comes into play. It streamlines work across the entire model lifecycle- from experimentation to execution.

This article provides an example of the ModelOps process created for the large European Oil and Gas company. It shows how SAS can help operationalize the model on the customer’s environment of choice using SAS Model Manager and SAS Workflow Manager.

The architecture of ModelOps process

In this process, we have 3 different environments:

The development environment, which utilizes Open Source Tensorflow library to train the model.
The governance environment, represented by SAS Viya’s Model Manager and Workflow Manager.
The production environment, with Red Hat OpenShift, placed in the AWS cloud service

Figure 1. Architecture of the ModelOps process

All these environments could work in the cloud, leveraging the benefits of this technology, and may utilize most of the popular tools for the development and deployment of machine learning models.

Process stages

The process runs according the following stages:

The Data Scientist runs Tensorflow model experiments in the development environment and tracks them using Mlflow.
The Data Scientist registers the Champion model in SAS Model Manager with SAS pzmm and sasctl libraries. The Champion model is subjected to a validation process. If it passes validation, the model will be deployed on Red Hat OpenShift (OKD). The deployment process is orchestrated by SAS Workflow Manager. As result, Google Tensorflow serving images get deployed into the OKD project previously created by IT Cluster Admin.
IT teams deploy a client application stack to simulate scoring requests. This includes a dedicated sidecar container for pushing logs directly to a backend. Logs are stored in a PostgresSQL database.
Logs are then read by SAS Model Manager’s build-in performance monitoring service, which sends notifications with regards to the assessment results.
When the model starts underperforming, SAS Workflow Manager triggers automated retraining based on new data and sends a message in Microsoft Teams.
The Data Scientist receives the notification and reviews the retrained model in order to choose a new champion of the project and then the model lifecycle’s loop continues.

Role of the SAS in the process

SAS Viya acts as the key enabler for this ModelOps process because of two modules:

SAS Model Manager is a centralized repository for registering, validating, deploying into production, monitoring and versioning machine learning models. REST API of this tool provides securable and convenient access to all information and artifacts related to the registered models.
SAS Workflow Manager is the orchestrator of the process. It provides task automation (including deploying, repetitive model quality assessment, notifications, retraining), and negotiation between users and the process flow. For the end-user it simplifies complex procedures, providing simple web form to work with, and running parameterized SAS Job Execution Definitions under the hood.

ModelOPS process represented in SAS Workflow Manager

This process is reflected in SAS Workflow Manager’s workflow definition.

Figure 2. ModelOps process represented in BPMN format of the workflow definition

SAS Workflow Manager has a simple and flexible graphical representation of the operationalization process. Using the graphical interface, you can create, modify and track your ModelOps processes. It aligns with widely used Business Process Management Notation v 2.0. Therefore, you can use directed acrylic graphs (DAGs), and also run cycled processes, waiting points, timers, branches and so on.

The key components of the workflow process are:

Service tasks – blocks with the gear symbol in the diagram. Under service tasks, we can run different shell scripts wrapped in SAS Job Execution form. This is the easy way to integrate workflow process, executed in SAS Viya and side-framework scripts, by run X commands and other technics. In the scenario, we used service tasks for notifications at each stage of the model lifecycle.
User task – blocks with the human figure. These tasks allow interaction between user and process. The user interface for this interaction is located at the ‘Tasks’ tab of Model Manager’s GUI. We can create questionnaires and use a role-based approach to map tasks to the right users or groups of users (such as developers or validators).
Gateways – blocks with the X sign. These define the way, that should be picked according to variables, specific to a particular process, named data objects. We can modify these data objects by user or service tasks.
Subprocesses – pale blocks in the diagram. They provide an option to organize the process flow to make it more transparent and easier to manage. Subprocess acts like a process in miniature.

Deployment stage of the workflow process

The deployment stage is the first part of the workflow, in which the model is moved into production on OpenShift Kubernetes.

Figure 3. User interface to work with workflow and service task at SAS Viya platform

At the starting point, this workflow process receives all necessary parameters, such as name or unique identifier of the SAS Model Manager’s project (to refer to the right model content), communication channel for notifications from the workflow engine (in our example, – a mailing service and Microsoft Teams) and KPI for defining acceptable model quality threshold (in our case measured in Kolmogorov-Smirnov statistic).

The next stage is the ‘approve champion model’ user task where the validator assesses the model before deployment into production. This is an example of combining human expertise with business rules and running within the single process flow. According to this task, the user should set a certain tag of the image, as a prerequisite for the next stage.

Key service tasks for the deployment stage

Prebuild Tensorflow serving image
Build Tensorflow serving image
Deploy Tensorflow serving image on OpenShift

The prebuild service task validates the Champion model and downloads the model artifact to the server using the Model Repository REST service and an environment configuration YAML file.
The build service task builds the image on the local registry based on the model artifact and a temporary Tensorflow serving container. It consists of the following steps:

Setup the environment
Download the TensorFlow Serving Docker image
Run a temporary Tensorflow container
Copy the model inside the temporary Tensorflow container
Commit the Champion Tensorflow image

The deploy service task pushes the new Champion Model Serving image on the OpenShift Container platform remotely. The related steps are:

Log in to the OpenShift Registry using a hostname, port, username, and token
Tag the docker image in the proper way (<registry>/<project>/<imagestream>:<tag>)
Push the image to OpenShift using push a service account

Figure 4. We receive Microsoft Teams notification of model deployment status as workflow process goes by

At the end of the deployment stage, our model is up and running and can be instantly used to support decision making.

Figure 5. Scoring process of Tensorflow model on OpenShift.

Production stage of workflow

As soon as the model is deployed, the workflow moves to "Production stage" sub-process. This covers Performance monitoring and retraining of the champion model deployed on OpenShift.

Figure 6. Detailed view of the ‘production stage’ subprocess

During the production stage, all data processed by the model are saved. Later on, this data will be useful to assess model performance and for retraining purposes.

We run repetitive assessments of model performance, using a defined timer object. We can set a flexible schedule for testing, for example: once per day, week or month, depending on the business case.

If model performance results will be unsatisfactory (in our case measured in terms of KS index), the process will trigger automatic retraining of the Tensorflow estimator using the new data. It will also register the new version of the model with the PZMM and SASCTL library.

Then the new model will be validated manually in the ‘approve champion model’ user task and the process will go on into ModelOps infinite loop until the business problem remains relevant.

Summary

This shows how SAS can help Data Scientists to operationalize their models in a code-agnostic way. SAS Model Manager and SAS Workflow Manager enable a ModelOps process for machine learning development and deployment frameworks-of-choice. As result we reduce time-to-market of analytical models and make their performance stabile over time. It is possible because of the openness of the SAS Viya platform, reflected in the wide range of built-in RESTs collection and advanced orchestration capabilities.

Thanks for reading. Your feedback and questions are welcomed!