In SAS Viya, one can publish and run a SAS model in the following target data platforms:
Each destination is unlocked when the corresponding SAS In-Database Technologies Addon has been licensed or when the customer has licensed an offering that includes all of them like SAS Visual Data Science Decisioning.
In this article, we will focus on a popular destination which is Databricks on Azure.
I’ll be brief since this is not the place to overwhelm you with all the installation details. Indeed, we need to install the SAS Embedded Process in Databricks. You can check the documentation here.
The SAS Embedded Process is this lightweight SAS engine that will be deployed on a cluster (here a Databricks cluster) and that takes advantage of the cluster infrastructure. Basically, it will be able to run SAS code in parallel on the cluster’s distributed data.
Once we have deployed the SAS Embedded Process in Azure Databricks, we are almost ready. There is an additional configuration needed that is tightly related to how the overall publishing framework works.
Indeed, unlike Teradata or Hadoop, we are not strictly publishing a SAS model into Databricks, but rather into a cloud object storage location which will then be accessed by the Databricks cluster. Thus, we need to setup a mount point (DBFS for Databricks File System) between Databricks and Azure Data Lake Storage (ADLS) so that Databricks can see the models SAS will publish.
This is depicted in the following figure:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In SAS Viya, to manage the entire process, we need an ADLS caslib for publishing the model and a Spark caslib for executing the model.
This picture is obviously addressing Databricks on Azure only but the process would be similar with Databricks on AWS. An S3 caslib would then be required to publish SAS models in S3.
To publish a SAS model for consumption in Databricks, I only need an ADLS caslib. The code looks like the following:
caslib adls datasource= ( srctype="adls", accountname="_my-storage-account_", filesystem="_my-container_", applicationid="_my-application-id_", resource="https://storage.azure.com/", dnssuffix="dfs.core.windows.net" ) subdirs libref=adls ; proc scoreaccel sessref=mysession ; publishmodel target=filesystem caslib="adls" password="_my-application-secret_" modelname="01_gradboost_astore" storetables="spark.gradboost_store" modeldir="/models" replacemodel=yes ; quit ;
We first create a caslib pointing to an ADLS location. Then we can use the SCOREACCEL procedure and the publishmodel statement to publish a SAS model to ADLS. The target=filesystem option indicates we are publishing to an object storage caslib. Here we publish an ASTORE-based model (storetables option) and we specify where in ADLS we want it to be created (modeldir option). We could also publish DATA STEP-based models instead of ASTORE-based ones.
Behind the scenes, PROC SCOREACCEL PUBLISHMODEL calls the modelPublishing.publishModelExternal CAS action. You can use the CAS action if you prefer or if you have to (Python, Lua, etc.).
In the Azure portal, we should be able to see the publishing result. Indeed, a .is file (itemstore) has been created:
By the way, it is worth mentioning that the model published in ADLS is data platform-agnostic. It can be used by both Databricks and Azure Synapse Analytics.
To run a SAS model in Databricks, we need a SPARK caslib. And as a reminder of what I said earlier, we also need Databricks to be able to see the contents of the ADLS location (DBFS mount).
We are ready to start a Spark continuous session of the SAS Embedded Process. This is the preferred way instead of the default behavior in which the SAS Embedded Process starts and stops at every call.
caslib spark datasource= ( srctype="spark", platform=databricks, driverClass="com.simba.spark.jdbc.Driver", classpath="/azuredm/access-clients/spark/SparkJDBC42.jar", url="_my-jdbc-uri_" schema="default", bulkload=no, username="token", password="_my-databricks-token_", authtoken="_my-databricks-token_", clusterid="_my-databricks-clusterid_", resturl="_my-url-to-databricks-server-hostname_", server="_my-databricks-server-hostname_", hadoopJarPath="/azuredm/access-clients/spark/jars/sas" ) libref=spark ; proc cas ; sparkEmbeddedProcess.startSparkEP caslib="spark" ; quit ;
We can run the model now:
proc scoreaccel sessref=mysession ; runmodel target=databricks caslib="spark" modelname="01_gradboost_astore" modeldir="dbfs:/mnt/adls/models" intable="hmeq_prod" outtable="hmeq_prod_out_astore" forceoverwrite=yes ; quit ;
We use the SCOREACCEL procedure and the runmodel statement to run a SAS model in Databricks. We specify which Spark input table we want to score and which Spark output table we want to create (intable and outtable). The modeldir option specifies the Databricks File System (DBFS) mount point to the ADLS location.This is where Databricks can find the SAS model previously published in ADLS.
Behind the scenes, PROC SCOREACCEL RUNMODEL calls the modelPublishing.runModelExternal CAS action. You can use the CAS action if you prefer or if you have to (Python, Lua, etc.).
Once you are done with the execution of all your models, you can stop the Spark continuous session:
proc cas ; sparkEmbeddedProcess.stopSparkEP caslib="spark" ; quit ;
Thanks for reading.
Read more about In-Database:
Find more articles from SAS Global Enablement and Learning here.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.