With SAS Viya 2022.1.3 release, users can access the Azure Databricks SPARK (Ver-3 onwards) cluster using the CData SPARK data connector. The CData data connector is a JDBC-based driver provided by the CData software company. The CData connector comes with standard Viya deployment. It enables the user to read data tables from Azure Databricks database to SAS Compute server and CAS.
The CData Data connector supports Serial and Multi-Node data transfer methods between CAS and Azure Databricks database.
This post is about SAS Viya users accessing the Azure Databricks SPARK (Ver 3) database table using the CData data connector.
For information on creating an Azure Databricks Workspace, Token, and Spark Cluster refer to the post - CAS Accessing Azure DataBricks SPARK cluster An Azure Databricks SPARK cluster with connection information.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
You can use the following statement at the SAS Compute server to get the list of CData drivers available under CLASSPATH.
Code:
options sastrace="d,,," sastraceloc=saslog nostsuffix ;
libname mytest jdbc url="jdbc:mytest" ;
Log extract :
....
..............
79
JDBC: attempting JDBC connection: jdbc:mytest
/opt/sas/viya/home/SASFoundation/lib/access/hadoop/access-hadoop-hivehelper-2.1.21.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.googledrive.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.databricks.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.youtubeanalytics.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.sparksql.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.apachehive.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.odata.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.onedrive.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.twitter.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.facebook.jar
/opt/sas/viya/home/lib64/accessclients/jdbc/cdata.jdbc.googleanalytics.jar
80 options sastrace="d,,," sastraceloc=saslog nostsuffix ;....
..............
.................
With Azure Databricks Workspace Token, SPARK-3 Cluster, and CData driver in place, you can use the SPARK engine in the LIBNAME statement to access the Azure Databricks tables. The Azure Databricks Workspace token (key) is the password to authenticate to the Databricks environment.
Code:
/* Note : variable value in quotes generate errors, So keep it without quotes. */
%let MYDBRICKS=adb-7078320493216227.7.azuredatabricks.net;
%let MYPWD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
%let MYHTTPPATH=sql/protocolv1/o/7078320493216227/0923-184743-ig29dref;
%let MYDRIVERCLASS="cdata.jdbc.databricks.DatabricksDriver";
%let MYURL="jdbc:databricks:Server=&MYDBRICKS;HTTPPath=&MYHTTPPATH;Database=default;QueryPassthrough=true;Token=&MYPWD" ;
libname CdtSpark spark driverClass=&MYDRIVERCLASS url=&MYURL ;
Proc SQL outobs=20;
select * from CdtSpark.baseball_prqt ;
run;quit;
Log extract:
....
..............
0 /* Note : variable value in quotes generate errors, So keep it without quotes. */
81 %let MYDBRICKS=adb-7078320493216227.7.azuredatabricks.net;
82 %let MYPWD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
83 %let MYHTTPPATH=sql/protocolv1/o/7078320493216227/0923-184743-ig29dref;
84 %let MYDRIVERCLASS="cdata.jdbc.databricks.DatabricksDriver";
85
86 %let MYURL="jdbc:databricks:Server=&MYDBRICKS;HTTPPath=&MYHTTPPATH;Database=default;QueryPassthrough=true;Token=&MYPWD" ;
87
88 libname CdtSpark spark driverClass=&MYDRIVERCLASS url=&MYURL ;
NOTE: Libref CDTSPARK was successfully assigned as follows:
Engine: SPARK
Physical Name:
jdbc:databricks:Server=adb-7078320493216227.7.azuredatabricks.net;HTTPPath=sql/protocolv1/o/7078320493216227/0923-184743-ig29d
ref;Database=default;QueryPassthrough=true;Token=dapicad2e306e3e80c82ee18d92c7853fab2-3
89
90 Proc SQL outobs=20;
91 select * from CdtSpark.baseball_prqt ;
WARNING: Statement terminated early due to OUTOBS=20 option.
92 run;quit;
NOTE: PROC SQL statements are executed immediately; The RUN statement has no effect.
NOTE: The PROCEDURE SQL printed pages 1-2.
NOTE: PROCEDURE SQL used (Total process time):
real time 2.49 seconds
cpu time 0.10 seconds
93
....
..............
Result Output:
With Azure Databricks Workspace Token, SPARK-3 Cluster, and CData driver in place, you can use the SPARK data connector to create a CASLIB. You can use the Serial and Multi-Node load method to load CAS from the Azure Databricks table. The Azure Databricks Workspace token (key) is the password to authenticate to the Databricks environment.
Serial Load
Code:
/* Note : variable value in quotes generate errors, So keep it without quotes. */
%let MYDBRICKS=adb-7078320493216227.7.azuredatabricks.net;
%let MYPWD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
%let MYHTTPPATH=sql/protocolv1/o/7078320493216227/0923-184743-ig29dref;
%let MYDRIVERCLASS="cdata.jdbc.databricks.DatabricksDriver";
%let MYURL="jdbc:databricks:Server=&MYDBRICKS;httpPath=&MYHTTPPATH;Database=default;QueryPassthrough=true;Token=&MYPWD" ;
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib Cdtspkcaslib dataSource=(srctype='spark',
url=&MYURL,
driverclass=&MYDRIVERCLASS,
BULKLOAD=NO,
schema="default" );
proc casutil outcaslib="Cdtspkcaslib" incaslib="Cdtspkcaslib" ;
list files ;
quit;
/* Load CAS from DataBricks database table */
proc casutil outcaslib="Cdtspkcaslib" incaslib="Cdtspkcaslib" ;
load casdata="iot_device" casout="iot_device" replace;
list tables;
quit;
CAS mySession TERMINATE;
Log extract:
....
..............
90 caslib Cdtspkcaslib dataSource=(srctype='spark',
91 url=&MYURL,
92 driverclass=&MYDRIVERCLASS,
93 BULKLOAD=NO,
94 schema="default" );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'CDTSPKCASLIB' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'CDTSPKCASLIB'.
NOTE: Action 'table.addCaslib' used (Total process time):
NOTE: real time 0.010392 seconds
NOTE: cpu time 0.023097 seconds (222.26%)
NOTE: total nodes 3 (12 cores)
NOTE: total memory 94.08G
NOTE: memory 2.19M (0.00%)
NOTE: Action to ADD caslib CDTSPKCASLIB completed for session MYSESSION.
....
..............
100 /* Load CAS from DataBricks database table */
101 proc casutil outcaslib="Cdtspkcaslib" incaslib="Cdtspkcaslib" ;
NOTE: The UUID '145cf014-660f-a340-a0a6-e7a6abf83318' is connected using session MYSESSION.
102 load casdata="iot_device" casout="iot_device" replace;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Spark.
NOTE: Cloud Analytic Services made the external data from iot_device available as table IOT_DEVICE in caslib Cdtspkcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
....
..............
Result Output:
Multi-Node Load
Code:
/* Note : variable value in quotes generate errors, So keep it without quotes. */
%let MYDBRICKS=adb-7078320493216227.7.azuredatabricks.net;
%let MYPWD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
%let MYHTTPPATH=sql/protocolv1/o/7078320493216227/0923-184743-ig29dref;
%let MYDRIVERCLASS="cdata.jdbc.databricks.DatabricksDriver";
%let MYURL="jdbc:databricks:Server=&MYDBRICKS;httpPath=&MYHTTPPATH;Database=default;QueryPassthrough=true;Token=&MYPWD" ;
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib Cdtspkcaslib dataSource=(srctype='spark',
url=&MYURL,
driverclass=&MYDRIVERCLASS,
BULKLOAD=NO,
schema="default" ,numreadnodes=10 );
proc casutil outcaslib="Cdtspkcaslib" incaslib="Cdtspkcaslib" ;
list files ;
quit;
/* Load CAS from DataBricks database table */
proc casutil outcaslib="Cdtspkcaslib" incaslib="Cdtspkcaslib" ;
load casdata="baseball_prqt" casout="baseball_prqt" options=(sliceColumn="salary") replace;;
list tables;
quit;
CAS mySession TERMINATE;
Log extract:
....
..............
NOTE: The CAS statement request to update one or more session options for session MYSESSION completed.
89
90 caslib Cdtspkcaslib dataSource=(srctype='spark',
91 url=&MYURL,
92 driverclass=&MYDRIVERCLASS,
93 BULKLOAD=NO,
94 schema="default" ,numreadnodes=10 );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'CDTSPKCASLIB' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'CDTSPKCASLIB'.
....
..............
100 /* Load CAS from DataBricks database table */
101 proc casutil outcaslib="Cdtspkcaslib" incaslib="Cdtspkcaslib" ;
NOTE: The UUID 'b70c08f1-37c3-6c4c-9114-541d032709f6' is connected using session MYSESSION.
102 load casdata="baseball_prqt" casout="baseball_prqt" options=(sliceColumn="salary") replace;;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Spark.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(2). The load will proceed with numReadNodes=2.
NOTE: Cloud Analytic Services made the external data from baseball_prqt available as table BASEBALL_PRQT in caslib Cdtspkcaslib.
....
..............
Result Output:
Important Links:
Accessing Social Media data (and more) from SAS Viya with embedded CData JDBC drivers
CAS Accessing Azure DataBricks SPARK cluster
Loading CAS from Azure Databricks Delta table
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.