Decision makers in cloud-savvy organisations continuously seek ways to reduce unpredictability of cost for consuming cloud services, while aiming to maximise their return on investment. Those investing in solutions based on Lakehouse architecture, such as Databricks, anticipate cost savings and productivity gains from this unified storage solution for their cloud-migrated data. After all, Delta Lake storage, which underpins the Lakehouse architecture, combines the strengths of both Data Lakes and Data Warehouses, leveraging the inherent flexibility of the cloud.
However, what if certain business operations require processing the data in Lakehouse, using an application that is not co-located with it? Does this mean the data must be transferred to the remote application? This scenario would introduce concerns about network latency, performance, and egress cost. Alternatively, should organisations consider replacing these remote applications with new, co-located ones? This approach also seems impractical, as it would be very disruptive and counterproductive to the anticipated business objectives.
Some recently published blog posts by my colleagues on SAS Communities (see reference links below) describe how SAS can harness the analytical power of data stored in Databricks. For example, Cecily Hoffritz discussed in her blog how user-friendly interface of SAS Viya enables more users within an organisation to use the Databricks Lakehouse and participate in data analysis, decision-making, and innovation, regardless of their technical background. In another example, Patric Hamilton explained in his blog how SAS Data Quality can be applied to the Databricks Lakehouse for Entity Resolution, enhancing the effectiveness and accuracy of data-driven decisions. I will continue with the theme in this blog and describe how can organisations extract even more value from their investment in Databricks through the “In-Database processing” capabilities of SAS.
As the name suggests, “SAS In-Database“ processing allows processing to happen inside the database to utilise resources much more efficiently and effectively. Examples of SAS In-Database processing with Databricks include Implicit and Explicit SQL Pass-Through, In-Database Procedures and In-Database Model Scoring facilities. In-Database Scoring for Databricks enables the environment to leverage Massive Parallel Processing architecture of the database and allows processing to happen locally at the database level. Only the final result travels to network, that too if required. The function remains in the SAS language and is executed by a lightweight SAS engine (SAS Embedded Process for Spark) deployed within the Databricks cluster. Deployment of SAS Embedded process is a one-time task for the administrator, and a simplified view of the overall process could look as shown below.
Once SAS Embedded Process has been deployed, users can develop models using SAS Studio (by writing SAS code), SAS Model Studio or SAS Intelligent Decisioning (by creating visual pipelines). SAS In-Database scoring supports the following platforms to publish and run the models in Databricks:
Databricks on AWS or on Azure |
|
Here is a list of steps and example code to publish a model to a Spark table and run it in Azure Databricks:
cas mysess;
caslib myspark datasource=(
srcType="spark",
platform="databricks",
server="myserver.azuredatabricks.net",
userName="token",
password="authentication-token",
clusterId="cluster-id",
jobManagementUrl="https://nnnnn.cloud.databricks.net/",
httpPath="my-http-path",
schema="my-schema"
);
proc scoreaccel sessref=mysess;
publishmodel
exttype=databricks
caslib="myspark"
modelname="mymodel"
storefiles="/myfiles/mystore.ast"
programfile="/myfiles/myprogram.sas"
modeldatabase="mydatabase"
;
run;
quit;
proc cas;
sparkEmbeddedProcess.startSparkEP
caslib="myspark";
run;
quit;
proc scoreaccel sessref=mysess;
runmodel
exttype=spark
caslib="myspark"
modelname="mymodel"
modeldatabase="mydatabase"
intable="mytable"
outtable="mytable_out";
run;
quit;
proc cas;
sparkEmbeddedProcess.stopSparkEP caslib="myspark";
run;
quit;
SAS In-Database processing enables organisations to extend the value of their data stored in a Databricks cluster by executing programs so that data is analysed without crossing the database boundary – giving them the benefits of improved performance, productivity and governance, while avoiding the data egress cost.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.