Seamless Power, Dual Brilliance: SAS Analytics and Data Management, Now Within Databricks

5 Likes

Decision makers in cloud-savvy organisations continuously seek ways to reduce unpredictability of cost for consuming cloud services, while aiming to maximise their return on investment. Those investing in solutions based on Lakehouse architecture, such as Databricks, anticipate cost savings and productivity gains from this unified storage solution for their cloud-migrated data. After all, Delta Lake storage, which underpins the Lakehouse architecture, combines the strengths of both Data Lakes and Data Warehouses, leveraging the inherent flexibility of the cloud.

However, what if certain business operations require processing the data in Lakehouse, using an application that is not co-located with it? Does this mean the data must be transferred to the remote application? This scenario would introduce concerns about network latency, performance, and egress cost. Alternatively, should organisations consider replacing these remote applications with new, co-located ones? This approach also seems impractical, as it would be very disruptive and counterproductive to the anticipated business objectives.

Some recently published blog posts by my colleagues on SAS Communities (see reference links below) describe how SAS can harness the analytical power of data stored in Databricks. For example, Cecily Hoffritz discussed in her blog how user-friendly interface of SAS Viya enables more users within an organisation to use the Databricks Lakehouse and participate in data analysis, decision-making, and innovation, regardless of their technical background. In another example, Patric Hamilton explained in his blog how SAS Data Quality can be applied to the Databricks Lakehouse for Entity Resolution, enhancing the effectiveness and accuracy of data-driven decisions. I will continue with the theme in this blog and describe how can organisations extract even more value from their investment in Databricks through the “In-Database processing” capabilities of SAS.

As the name suggests, “SAS In-Database“ processing allows processing to happen inside the database to utilise resources much more efficiently and effectively. Examples of SAS In-Database processing with Databricks include Implicit and Explicit SQL Pass-Through, In-Database Procedures and In-Database Model Scoring facilities. In-Database Scoring for Databricks enables the environment to leverage Massive Parallel Processing architecture of the database and allows processing to happen locally at the database level. Only the final result travels to network, that too if required. The function remains in the SAS language and is executed by a lightweight SAS engine (SAS Embedded Process for Spark) deployed within the Databricks cluster. Deployment of SAS Embedded process is a one-time task for the administrator, and a simplified view of the overall process could look as shown below.

Once SAS Embedded Process has been deployed, users can develop models using SAS Studio (by writing SAS code), SAS Model Studio or SAS Intelligent Decisioning (by creating visual pipelines). SAS In-Database scoring supports the following platforms to publish and run the models in Databricks:

Databricks on AWS or on Azure

Spark Table (as of version 2023.10)
Amazon S3
Microsoft ADLS Gen2

Here is a list of steps and example code to publish a model to a Spark table and run it in Azure Databricks:

Step 1: Start a CAS session and assign a caslib to Spark

cas mysess;
caslib myspark datasource=(
       srcType="spark",
       platform="databricks",
       server="myserver.azuredatabricks.net",
       userName="token",
       password="authentication-token",
       clusterId="cluster-id",
       jobManagementUrl="https://nnnnn.cloud.databricks.net/",
       httpPath="my-http-path",
       schema="my-schema"
       );

Step 2: Publish a model to a Spark table

proc scoreaccel sessref=mysess;
   publishmodel
      exttype=databricks
      caslib="myspark"
      modelname="mymodel"
      storefiles="/myfiles/mystore.ast"
      programfile="/myfiles/myprogram.sas"
      modeldatabase="mydatabase"
      ;
   run;
quit;

Step 3: Start a Databricks interactive session before you run the model

proc cas;
   sparkEmbeddedProcess.startSparkEP
   caslib="myspark";
run; 
quit;

Step 4: Run the model

proc scoreaccel sessref=mysess;
   runmodel 
      exttype=spark
      caslib="myspark"
      modelname="mymodel" 
      modeldatabase="mydatabase" 
      intable="mytable"
      outtable="mytable_out";
run;
quit;

Step 5: Stop the SAS Embedded Process for Spark continuous session

proc cas;
   sparkEmbeddedProcess.stopSparkEP caslib="myspark";
run; 
quit;

Conclusion

SAS In-Database processing enables organisations to extend the value of their data stored in a Databricks cluster by executing programs so that data is analysed without crossing the database boundary – giving them the benefits of improved performance, productivity and governance, while avoiding the data egress cost.