Accessing Azure Databricks from SAS 9.4

1 Like

Azure Databricks is the jointly developed data and AI service from Databricks and Microsoft for data analytics users. Azure Databricks is optimized for Azure data lakes and provides an interactive workspace to set up the environment and collaborate amongst the data scientist. Azure Databricks uses the SPARK engine to process the data. What is Azure Databricks?

SAS 9.4 users can access the Azure Databricks workspace and data tables using JDBC Driver. At present, there is no dedicated SAS/ACCESS Interface to the Databricks database. A new SAS/ACCESS engine for Databricks is schedule for the Aug-Sept release. In the meantime, users can use SAS/ACCESS to JDBC interface to access the Azure Databricks data table. Users can only read data from Azure Databricks using the JDBC Access engine. This post is about accessing the Azure Databricks data table from SAS 9.4 environment.

Pre-requisites

SAS/ACCESS Interface to JDBC at SAS Compute server
JDBC Jar file available at SAS Compute server
Access token from Azure Databricks Workspace
SPARK cluster JDBC URL information
SPARK cluster configured to access storage account ADLS2 storage
SPARK Data-frame saved as a table to share with SAS applications

Data access path

The following picture describes the SAS 9.4 environment access to the Azure Databricks database table.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Azure Databricks workspace setup

Before you can access the data table from Azure Databricks, you need to have or create the Azure Databricks workspace. The Databricks workspace is the entry point for external applications to access the objects and data from Databricks. The Databricks workspace user credential is required to connect to the SPARK cluster from an external application.

The following screen describes the Azure Databricks Workspace and user credentials to access the SPARK cluster.

Azure Databricks SPARK cluster at workspace

With Databricks workspace in place, you can create a SPARK cluster to process data ingested from Azure storage. The following screen describes the creation of the SPARK cluster under Azure Databricks Workspace

Azure Databricks SPARK cluster connection information is available at the cluster configuration tab.

Azure Databricks JDBC driver

The third-party application can access to Databricks table using the JDBC driver. The JDBC Driver is available at the following link.

Databricks JDBC Driver download

Ingest data into Azure Databricks SPARK cluster

With the SPARK cluster in place at Azure Databricks workspace, you can ingest data into the SPARK cluster from ADLS2 storage or Databricks File system files. The Databricks workspace has a Notebook editor to run Python code to interact with the SPARK cluster. The following Python statement ingests data from a JSON file to the SAPRK cluster and displays the data from the SPARK data frame.

Python Code:

#Read a sample data file (iot_devices.json) from Databricks DBFS location.
df = spark.read.json("dbfs:/databricks-datasets/iot/iot_devices.json")

#Create temporary view on Spark Data Frame "DF"
df.createOrReplaceTempView('source')

#Display top 10 ros from the source file.
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Before the data can be accessed from external applications, you need to write the SPARK data frame into a Databricks table. The following Python statement and screenshot describe the data written to Databricks table and available for external application.

Python Code:

#Write a parmanent table to share with other users and application.
permanent_table_name = "iot_device"
df.write.format("parquet").saveAsTable(permanent_table_name)

Access to the Azure Databricks table from SAS 9.4

With Azure Databricks Workspace, SPARK Cluster, database table, and JDBC driver in place, you can use the following code to read data from the Azure Databricks table. The Azure Databricks Workspace token (key) is used as the password to authenticate to the environment.

Code:

options sastrace=',,,d' sastraceloc=saslog ;

/* Note : variable value in quotes generate errors, So keep it without quotes. */
%let MYDBRICKS=adb-7060859955656306.6.azuredatabricks.net;
%let MYPWD=dapiaa66843abadb51775a9dd7858d6980aa-2;
%let MYHTTPPATH=sql/protocolv1/o/7060859955656306/0210-155120-shop163;
%let MYUID=token;


libname dbricks jdbc driverclass="com.simba.spark.jdbc.Driver"           
url="jdbc:spark://&MYDBRICKS:443/default;transportMode=http;ssl=1;httpPath=&MYHTTPPATH;AuthMech=3;UID=&MYUID;PWD=&MYPWD"
  classpath="/mnt/myazurevol/config/access-clients/JDBC/"
  schema="default" ;

Proc SQL outobs=20;;
select * from dbricks.iot_device ;
run;
quit;

Log extract :

.....
..............
78
79   options sastrace=',,,d' sastraceloc=saslog ;
80
81   /* Note : variable value in quotes generate errors, So keep it without quotes. */
82   %let MYDBRICKS=adb-7060859955656306.6.azuredatabricks.net;
83   %let MYPWD=dapiaa66843abadb51775a9dd7858d6980aa-2;
84   %let MYHTTPPATH=sql/protocolv1/o/7060859955656306/0210-155120-shop163;
85
86   %let MYUID=token;
87
88
89   libname dbricks jdbc driverclass="com.simba.spark.jdbc.Driver"
90      url="jdbc:spark://&MYDBRICKS:443/default;transportMode=http;ssl=1;httpPath=&MYHTTPPATH;AuthMech=3;UID=&MYUID;PWD=&MYPWD"
91          classpath="/mnt/myazurevol/config/access-clients/JDBC/"
92              schema="default" ;
NOTE: Libref DBRICKS was successfully assigned as follows:
      Engine:        JDBC
      Physical Name:
      jdbc:spark://adb-7060859955656306.6.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/7060859
      955656306/0210-155120-shop163;AuthMech=3;UID=token;PWD=dapiaa66843abadb51775a9dd7858d6980aa-2
93
94
95
96   Proc SQL outobs=20;
96 !                    ;
97   select * from dbricks.iot_device ;
  24 1612975267 no_name 0 SQL
JDBC_1: Prepared: on connection 0 25 1612975267 no_name 0 SQL
SELECT * FROM default.IOT_DEVICE WHERE 0=1 26 1612975267 no_name 0 SQL
  27 1612975267 no_name 0 SQL
  28 1612975267 no_name 0 SQL
JDBC_2: Executed: on connection 0 29 1612975267 no_name 0 SQL
SELECT * FROM default.IOT_DEVICE 30 1612975267 no_name 0 SQL
  31 1612975267 no_name 0 SQL
WARNING: Statement terminated early due to OUTOBS=20 option.
98   run;
NOTE: PROC SQL statements are executed immediately; The RUN statement has no effect.
99   quit;
..........
.................

Result Output:

Important Link: What is Azure Databricks ?

Find more articles from SAS Global Enablement and Learning here.

michelvankessel · ‎07-19-2022

What if a Proxy server is used? How would the connection string look like?

M9LHxLPLc55P · ‎02-27-2023

Hello, everyone!

I tried to use the example, however I get this error:

ERROR: Error trying to establish connection: java.lang.ClassNotFoundException: com.simba.spark.jdbc.Driver
ERROR: Error in the LIBNAME statement.

Can you guide me to solve it?

Thank you!

SASKiwi · ‎02-27-2023

@M9LHxLPLc55P - Any time you post an error you need to provide the statement that caused it otherwise we are just guessing what you are trying. It looks like you don't have the required JDBC driver installed or correctly configure. I suggest you talk to your SAS administrator regarding your SAS JDBC setup.

M9LHxLPLc55P · ‎02-27-2023

OK.

Code:

%let databricks=adb-4967661705947417.17.azuredatabricks.net;
%let path=sql/protocolv1/o/4967661705947417/0829-194118-waged92;

%let uid=token;
%let password=dapi3c0ae34f045b8197ac9a132a38822e6e-2;

%let port=443;
%let database=default;

libname dbricks jdbc driverclass="com.simba.spark.jdbc.Driver"
url="jdbc:spark://&databricks:&port/&database;transportMode=http;ssl=1;httpPath=&path;AuthMech=3;UID=&uid;PWD=&password"
classpath="/SASDATA/THIRDPARTY/CLIENT/Azure/Databricks"
schema="default" ;

Log:

1
2 %let databricks=abc-123456789.89.azuredatabricks.net;
3 %let path=sql/protocolv1/o/123456789/12345-67890-w4g3d92;
4
5 %let uid=token;
6 %let password=password;
7
8 %let port=443;
9 %let database=default;
10
11
12 libname dbricks jdbc driverclass="com.simba.spark.jdbc.Driver"
13 url="jdbc:spark://&databricks:&port/&database;transportMode=http;ssl=1;httpPath=&path;AuthMech=3;
13 ! UID=&uid;PWD=&password"
14 classpath="/SASDATA/THIRDPARTY/CLIENT/Azure/Databricks"
15 schema="default" ;
ERROR: Error trying to establish connection: java.lang.ClassNotFoundException: com.simba.spark.jdbc.Driver
ERROR: Error in the LIBNAME statement.
16

NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 1.81 seconds
user cpu time 0.04 seconds
system cpu time 0.04 seconds
memory 6293.03k
OS Memory 6820.00k
Timestamp 02/27/2023 01:26:55 PM
Step Count 0 Switch Count 8
Page Faults 0
Page Reclaims 5973
Page Swaps 0
Voluntary Context Switches 1374
Involuntary Context Switches 2
Block Input Operations 3368
Block Output Operations 72

UttamKumar · ‎02-28-2023

@M9LHxLPLc55P - What is the Databricks JDBC driver Jar file name you are using?

If you have latest JDBC jar file ( DatabricksJDBC42.jar ) file under class-path location use the following code to create LIBNAME statement. Notice the driverclass= and url= variable, it's using databricks (not spark) .

libname dbricks jdbc driverclass="com.databricks.client.jdbc.Driver"
url="jdbc:databricks://&databricks:&port/&database;transportMode=http;ssl=1;httpPath=&path;AuthMech=3;UID=&uid;PWD=&password"
classpath="/SASDATA/THIRDPARTY/CLIENT/Azure/Databricks"
schema="default" ;

The original code listed in this article works with the old version of JDBC driver (SparkJDBC42.jar) jar file. The "com.simba.spark.jdbc.Driver " is/was available in this jar file but not in the latest jar file.

Regards,

-Uttam

M9LHxLPLc55P · ‎03-22-2023

Hello, everyone!

I have a query about the JDBC configuration for Databricks in Azure.

To reach the cloud, a range of IP addresses is being used:

11.12.13.0/25

And with this the /etc/hosts is configured:

11.12.13.0   adb-000111222333444555.666.azuredatabricks.net
11.12.13.1   adb-000111222333444555.666.azuredatabricks.net
11.12.13.2   adb-000111222333444555.666.azuredatabricks.net
11.12.13.3   adb-000111222333444555.666.azuredatabricks.net
...

On the other hand, this is the SAS code defined to establish the connection:

options sastrace    = ',,,d'
        sastraceloc = saslog;

%let drvrclss = com.databricks.client.jdbc.Driver;
%let dtbrcks = adb-000111222333444555.666.azuredatabricks.net;
%let prt      = 443;
%let pthw1    = sql/protocolv1/o/111222333555777/0123-98765-waged92;
%let pthw2    = sql%2Fprotocolv1%2Fo%2F111222333555777%2F0123-98765-waged92;
%let uid      = token;
%let pwd      = dapi0a1b2c3d4e5f6g7h8i9j-0;
%let clsspth = /opt/sas/client/azure_databricks/;

libname dbricks jdbc driverclass = "&drvrclss"
                     url         = "jdbc:spark://&dtbrcks:&prt/default;transportMode=http;ssl=1;httpPath=&pthb1;AuthMech=3;UID=&uid;PWD=&pwd"
                     classpath   = "&clsspth"
                     schema      = "default" ;

However, I get this error:

ERROR: Error trying to establish connection: Could not open connection to
       jdbc:spark://adb-000111222333444555.666.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/111222333555777/0123-98765-waged92;AuthMech=3;UID=token;PWD=dapi0a1b2c3d4e5f6g7h8i9j-0. The JDBC connection string
       (URI) could not be handled by the driver.
ERROR: Error in the LIBNAME statement.

Is there something wrong with the libname configuration?

PS: As classpath I consider the path where is in jar file SparkJDBC42.jar