BookmarkSubscribeRSS Feed

Publish and Run a SAS Scoring Model In Azure Synapse Analytics

Started ‎09-08-2022 by
Modified ‎09-08-2022 by
Views 1,762

In my last post, I wrote about publishing and running a SAS scoring model in Azure Databricks. Let’s focus now on scoring in Azure Synapse Analytics. The overall process is quite similar.  

 

Setup

 

Like in Azure Databricks, we need to install the SAS Embedded Process in Azure Synapse Analytics. The deployment steps are documented here

 

As a reminder, the SAS Embedded Process is this lightweight SAS engine that will be deployed on a cluster (here a Spark pool) and that takes advantage of the cluster infrastructure. Basically, it will be able to run SAS code in parallel on the cluster’s distributed data.

 

Publish the Model

 

To be able to score data in Azure Synapse, we need to publish the model in ADLS (Azure Data Lake Storage) that will be accessed by Azure Synapse behind the scenes.

 

Indeed, when you create an Azure Synapse workspace, you are asked to link an ADLS Gen2 filesystem (blob container) to the workspace. This ADLS container is where the SAS models will be published.

 

nir_post_78_01_adls_linked.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The overall publishing and running process is depicted below:

 

nir_post_78_02_process.png

 

We need an ADLS caslib to publish a SAS model to ADLS and a Spark caslib to run it in Azure Synapse.

 

In addition, due to the nature of Azure Synapse which combines both data lake (Spark) and data warehouse (SQL Server) capabilities, we might need additional caslibs to manipulate/view data:

 

  • An SQL Server caslib that will access data lake (Spark) data
    • Indeed, the Spark caslib defined above for running models in Synapse does not support yet accessing Spark data. Thus, if we need to access Spark data, we need an SQL Server caslib.
  • An SQL Server caslib that will access data warehouse (SQL Server) data

 

To publish a SAS model for consumption in Azure Synapse, I only need an ADLS caslib. It is exactly the same step as for Databricks (check out previous blog for more details). The ADLS storage account that we are publishing to must be the one that is linked to your Azure Synapse workspace. The code looks like the following:

 

caslib adls datasource=
   (
      srctype="adls",
      accountname="**my-storage-account**",
      filesystem="**my-container**",
      applicationid="**my-application-id**",
      resource="https://storage.azure.com/", 
      dnssuffix="dfs.core.windows.net"
   ) subdirs libref=adls ;

proc scoreaccel sessref=mysession ;
   publishmodel
      target=filesystem
      caslib="adls"
      password="**my-application-secret**"
      modelname="01_gradboost_astore"
      storetables="spark.gradboost_store"
      modeldir="/models" 
      replacemodel=yes ;
quit ;

 


Run the model

 

To run a SAS model in Azure Synapse, we need a Spark caslib. This Spark caslib just acts as a placeholder for the connection details to the Spark pool in Synapse.

 

Then we start a Spark continuous session of the SAS Embedded Process and within Synapse we can specify how much resources we want to allocate to the Spark session.

 

/* Used for running models in Synapse */
caslib spark datasource=
   (
      srctype="spark",
      platform=synapse,
      username="**my-application-id**",
      password="**my-application-secret**",
      server="**synapse-workspace**.dev.azuresynapse.net",
      schema="sqlpool",
      hadoopJarPath="/azuredm/access-clients/spark/jars/sas",
      resturl="**livy-rest-url**",
      bulkload=no
   ) libref=spark ;

/* Start the SAS Embedded Process */
proc cas ;
   sparkEmbeddedProcess.startSparkEP caslib="spark" trace=false
                        executorInstances=4
                        executorCores=4
                        executorMemory=56
                        driverMemory=32 ;
quit ;

 

We can run the model now:

 

/* Run the model */
proc scoreaccel sessref=mysession ;
   runmodel 
      target=synapse
      caslib="spark"
      modelname="01_gradboost_astore" 
      modeldir="/models" 
      intable="hmeq_spark"
      schema="default"
      outtable="hmeq_spark_astore"
      outschema="default"
      forceoverwrite=yes ;
quit ;

 

Scoring data in Synapse is very flexible in terms of input and output data objects. As depicted in the following figure, you can take several routes to score data in Synapse from SAS:

 

nir_post_78_03_input_output.png

 

The following options drive the type of source/target data structure accessed:

  • intable/outtable for Spark tables (single-part names)
  • indataset/outdataset for Spark datasets
  • intable/outtable with a two-part name ("dbo.hmeq_sql") for SQL tables (the first part is the SQL database schema name)

 

You can even interact with the Spark session before or after the model execution, for pre- or post-processing:

 

/* Load a filtered Spark table into a Spark dataset */
proc cas ;
   sparkEmbeddedProcess.executeProgram caslib="spark"
      program="var dsin = spark.table(""default.hmeq_spark"").where($""REASON"" === ""DebtCon"");" ;
quit ;

 

The program option accepts a user-written Scala syntax. Once you are done with the execution of all your models, you can stop the Spark continuous session:

 

proc cas ;
   sparkEmbeddedProcess.stopSparkEP caslib="spark" ;
quit ;

 

Thanks for reading.

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎09-08-2022 02:19 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started