Accessing Azure Databricks from SAS Viya (CAS)

2 Likes

Azure Databricks is the jointly developed data and AI service from Databricks and Microsoft for data analytics users. Azure Databricks is optimized for Azure data lakes and provides an interactive workspace to set up the environment and collaborate amongst the data scientist. Azure Databricks uses the SPARK engine to process the data.

What is Azure Databricks?

SAS Viya users can access the Azure Databricks workspace and data tables using JDBC data connector. At present, there is no dedicated SAS Data Connector to Databricks. A new SAS Data Connector engine for Databricks is schedule for the Aug-Sept release. In the meantime, SAS Viya users can use Data Connector to JDBC to access the Azure Databricks data table. Users can only read data from Azure Databricks using the JDBC data connector.

This article is about accessing the Azure Databricks data table from SAS Viya 4 (CAS) environment.

Pre-requisites

SAS JDBC Data Connector at CAS server
Databricks JDBC Jar file available at CAS server
Access token from Azure Databricks Workspace
SPARK cluster JDBC URL information
SPARK cluster configured to access storage account ADLS2 storage
SPARK Data-frame saved as a table to share with SAS applications

Data access path

The following picture describes the SAS Viya(CAS) environment access to the Azure Databricks database table.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Azure Databricks workspace setup

Before you can access the data table from Azure Databricks, you need to have or create the Azure Databricks workspace. The Databricks workspace is the entry point for external applications to access the objects and data from Databricks. The Databricks workspace user credential is required to connect to the SPARK cluster from an external application.

The following screen describes the Azure Databricks Workspace and user credentials to access the SPARK cluster.

Azure Databricks SPARK cluster at workspace

With Databricks workspace in place, you can create a SPARK cluster to process data ingested from Azure storage.

The following screen describes the creation of the SPARK cluster under Azure Databricks Workspace

Azure Databricks SPARK cluster connection information is available at the cluster configuration tab.

Azure Databricks JDBC driver

The third-party application can access to Databricks table using the JDBC driver. The JDBC Driver is available at the following link.

Databricks JDBC Driver download

Ingest data into Azure Databricks SPARK cluster

With the SPARK cluster in place at Azure Databricks workspace, you can ingest data into the SPARK cluster from ADLS2 storage or Databricks File system files. The Databricks workspace has a Notebook editor to run Python code to interact with the SPARK cluster. The following Python statement ingests data from a JSON file to the SAPRK cluster and displays the data from the SPARK data frame.

Python Code:

#Read a sample data file (iot_devices.json) from Databricks DBFS location.
df = spark.read.json("dbfs:/databricks-datasets/iot/iot_devices.json")

#Create temporary view on Spark Data Frame "DF"
df.createOrReplaceTempView('source')

#Display top 10 ros from the source file.
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Before the data can be accessed from external applications, you need to write the SPARK data frame into a Databricks table. The following Python statement and screenshot describe the data written to Databricks table and available for external application.

Python Code:

#Write a parmanent table to share with other users and application.
permanent_table_name = "iot_device"
df.write.format("parquet").saveAsTable(permanent_table_name)

Access to the Azure Databricks table from SAS Viya(CAS)

With Azure Databricks Workspace, SPARK Cluster, database table, and JDBC driver in place, you can use the following code to serial load CAS from the Azure Databricks table. The Azure Databricks Workspace token (key) is used as the password to authenticate to the environment.

Code:

/* Note : variable value in quotes generate errors, So keep it without quotes. */
%let MYDBRICKS=adb-7060859955656306.6.azuredatabricks.net;
%let MYPWD=dapiaa66843abadb51775a9dd7858d6980aa-2;
%let MYHTTPPATH=sql/protocolv1/o/7060859955656306/0210-155120-shop163;
%let MYUID=token;

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib jdcaslib dataSource=(srctype='jdbc',           url="jdbc:spark://&MYDBRICKS:443/default;transportMode=http;ssl=1;httpPath=&MYHTTPPATH;AuthMech=3;UID=&MYUID;PWD=&MYPWD"
           class="com.simba.spark.jdbc.Driver",
           classpath="/mnt/myazurevol/config/access-clients/JDBC",
           schema="default" );

proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
    load casdata="iot_device" casout="iot_device" replace;
    list tables;
quit;

CAS mySession  TERMINATE;

Log extract :

 
.....
..............
79
80   /* Note : variable value in quotes generate errors, So keep it without quotes. */
81   %let MYDBRICKS=adb-7060859955656306.6.azuredatabricks.net;
82   %let MYPWD=dapiaa66843abadb51775a9dd7858d6980aa-2;
83   %let MYHTTPPATH=sql/protocolv1/o/7060859955656306/0210-155120-shop163;
84
85   %let MYUID=token;
86
87
88   CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
NOTE: The session MYSESSION connected successfully to Cloud Analytic Services
      controller.sas-cas-server-default.test.svc.cluster.local using port 5570. The UUID is 16d1fcba-a6f5-6e4c-bcb9-4900acf62f3e.
      The user is sasadm and the active caslib is CASUSER(sasadm).
NOTE: The SAS option SESSREF was updated with the value MYSESSION.
NOTE: The SAS macro _SESSREF_ was updated with the value MYSESSION.
NOTE: The session is using 3 workers.
NOTE: 'CASUSER(sasadm)' is now the active caslib.
NOTE: Action 'sessionProp.setSessOpt' used (Total process time):
NOTE: The CAS statement request to update one or more session options for session MYSESSION completed.
89
90   caslib jdcaslib dataSource=(srctype='jdbc',
91              url="jdbc:spark://&MYDBRICKS:443/default;transportMode=http;ssl=1;httpPath=&MYHTTPPATH;AuthMech=3;UID=&MYUID;
91 ! PWD=&MYPWD"
92              class="com.simba.spark.jdbc.Driver",
93              classpath="/mnt/myazurevol/config/access-clients/JDBC",
94              schema="default" );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'JDCASLIB' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'JDCASLIB'.
100  proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: The UUID '16d1fcba-a6f5-6e4c-bcb9-4900acf62f3e' is connected using session MYSESSION.
101
101!  load casdata="iot_device" casout="iot_device" replace;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
NOTE: Cloud Analytic Services made the external data from iot_device available as table IOT_DEVICE in caslib jdcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE: The Cloud Analytic Services server processed the request in 7.535416 seconds.
..........
.................
</pre>

Result Output:

Important Link: What is Azure Databricks ?

Related Article: Accessing Azure Databricks from SAS 9.4

Find more articles from SAS Global Enablement and Learning here.