Publish and Run a SAS Scoring Model In Azure Databricks

2 Likes

In SAS Viya, one can publish and run a SAS model in the following target data platforms:

Hadoop Cloud Services
Cloudera Data Platform
Databricks
Azure Synapse Analytics
Teradata

Each destination is unlocked when the corresponding SAS In-Database Technologies Addon has been licensed or when the customer has licensed an offering that includes all of them like SAS Visual Data Science Decisioning.

In this article, we will focus on a popular destination which is Databricks on Azure.

Setup

I’ll be brief since this is not the place to overwhelm you with all the installation details. Indeed, we need to install the SAS Embedded Process in Databricks. You can check the documentation here.

The SAS Embedded Process is this lightweight SAS engine that will be deployed on a cluster (here a Databricks cluster) and that takes advantage of the cluster infrastructure. Basically, it will be able to run SAS code in parallel on the cluster’s distributed data.

Publish the Model

Once we have deployed the SAS Embedded Process in Azure Databricks, we are almost ready. There is an additional configuration needed that is tightly related to how the overall publishing framework works.

Indeed, unlike Teradata or Hadoop, we are not strictly publishing a SAS model into Databricks, but rather into a cloud object storage location which will then be accessed by the Databricks cluster. Thus, we need to setup a mount point (DBFS for Databricks File System) between Databricks and Azure Data Lake Storage (ADLS) so that Databricks can see the models SAS will publish.

This is depicted in the following figure:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In SAS Viya, to manage the entire process, we need an ADLS caslib for publishing the model and a Spark caslib for executing the model.

This picture is obviously addressing Databricks on Azure only but the process would be similar with Databricks on AWS. An S3 caslib would then be required to publish SAS models in S3.

To publish a SAS model for consumption in Databricks, I only need an ADLS caslib. The code looks like the following:

caslib adls datasource=
   (
      srctype="adls",
      accountname="_my-storage-account_",
      filesystem="_my-container_",
      applicationid="_my-application-id_",
      resource="https://storage.azure.com/", 
      dnssuffix="dfs.core.windows.net"
   ) subdirs libref=adls ;

proc scoreaccel sessref=mysession ;
   publishmodel
      target=filesystem
      caslib="adls"
      password="_my-application-secret_"
      modelname="01_gradboost_astore"
      storetables="spark.gradboost_store"
      modeldir="/models" 
      replacemodel=yes ;
quit ;

We first create a caslib pointing to an ADLS location. Then we can use the SCOREACCEL procedure and the publishmodel statement to publish a SAS model to ADLS. The target=filesystem option indicates we are publishing to an object storage caslib. Here we publish an ASTORE-based model (storetables option) and we specify where in ADLS we want it to be created (modeldir option). We could also publish DATA STEP-based models instead of ASTORE-based ones.

Behind the scenes, PROC SCOREACCEL PUBLISHMODEL calls the modelPublishing.publishModelExternal CAS action. You can use the CAS action if you prefer or if you have to (Python, Lua, etc.).

In the Azure portal, we should be able to see the publishing result. Indeed, a .is file (itemstore) has been created:

By the way, it is worth mentioning that the model published in ADLS is data platform-agnostic. It can be used by both Databricks and Azure Synapse Analytics.

Run the model

To run a SAS model in Databricks, we need a SPARK caslib. And as a reminder of what I said earlier, we also need Databricks to be able to see the contents of the ADLS location (DBFS mount).

We are ready to start a Spark continuous session of the SAS Embedded Process. This is the preferred way instead of the default behavior in which the SAS Embedded Process starts and stops at every call.

caslib spark datasource=
   (
      srctype="spark",
      platform=databricks,
      driverClass="com.simba.spark.jdbc.Driver",
      classpath="/azuredm/access-clients/spark/SparkJDBC42.jar",
      url="_my-jdbc-uri_"
      schema="default",
      bulkload=no,
      username="token",
      password="_my-databricks-token_",
      authtoken="_my-databricks-token_",
      clusterid="_my-databricks-clusterid_",
      resturl="_my-url-to-databricks-server-hostname_",
      server="_my-databricks-server-hostname_",
      hadoopJarPath="/azuredm/access-clients/spark/jars/sas"
   ) libref=spark ;

proc cas ;
   sparkEmbeddedProcess.startSparkEP caslib="spark" ;
quit ;

We can run the model now:

proc scoreaccel sessref=mysession ;
   runmodel 
      target=databricks
      caslib="spark"
      modelname="01_gradboost_astore" 
      modeldir="dbfs:/mnt/adls/models" 
      intable="hmeq_prod"
      outtable="hmeq_prod_out_astore"
      forceoverwrite=yes ;
quit ;

We use the SCOREACCEL procedure and the runmodel statement to run a SAS model in Databricks. We specify which Spark input table we want to score and which Spark output table we want to create (intable and outtable). The modeldir option specifies the Databricks File System (DBFS) mount point to the ADLS location.This is where Databricks can find the SAS model previously published in ADLS.

Behind the scenes, PROC SCOREACCEL RUNMODEL calls the modelPublishing.runModelExternal CAS action. You can use the CAS action if you prefer or if you have to (Python, Lua, etc.).

Once you are done with the execution of all your models, you can stop the Spark continuous session:

proc cas ;
   sparkEmbeddedProcess.stopSparkEP caslib="spark" ;
quit ;

Thanks for reading.