Accessing Google Dataproc With SAS Viya

It has already been almost 8 years since Hadoop reached its peak in terms of interest in web searches. Spark then took over and Hadoop/Spark services are still widely used at numerous companies.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

SAS always tries to keep providing support for a countless number of data providers. And recently, Google Dataproc has been added to the data sources SAS Viya can work with.

What is Google Dataproc?

“Google Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, and 30+ open source tools and frameworks.”

Dataproc is very accessible and you can provision a ready-to-use cluster in 90 seconds.

When did SAS Viya start supporting Google Dataproc?

Hadoop (Hive) support for Google Dataproc has been introduced in SAS Viya Stable Release 2022.11 through the use of the included CData JDBC driver for Apache Hive.

What are the configurations steps?

The following steps have been tested with a Google Dataproc cluster setup on Compute Engines. Not discussed here are the required steps to allow network communication between SAS Viya and Google Dataproc (it highly depends on your setup).

Collect the configuration and JAR files

This step is recommended if you want to access HDFS under the covers, but not mandatory. Indeed, bulkload and bulkunload capabilities will be unlocked if HDFS can be used by SAS for staging data during read and write operations. HDFS access requires SAS to have access to certain configuration and JAR files collected beforehand.

To do this task, we will need to run the Hadoop Tracer Script on the Google Dataproc cluster’s name node.

The Hadoop Tracer Script can be downloaded from http://ftp.sas.com/techsup/download/blind/access/hadooptracer.zip. It has been recently updated to work with Google Dataproc.

We need to copy this zip file to the Dataproc’s name node and unzip it. The archive contains a Python script and a required json file.

The Hadoop Tracer Script makes use of the strace tool which needs to be install prior to running the script ("sudo apt-get install strace -y" for example on the Debian image used by Google Dataproc).

We are ready to run the script using the following command:

python ./hadooptracer_py --filterby=latest --postprocess --jsonfile ./driver.json --jars ./jars --conf ./conf --logfile ./tracer.log

This script can take 15 minutes to run. At the end, you should have a collection of configuration and JAR files in the ./jars and ./conf relative sub-folders on Dataproc’s name node.

Make the files available to SAS Compute Server and CAS

The collected files need to be available to Compute and CAS. Therefore, you need to copy them from the Dataproc name node to a location that will be available to both through a K8s Persistent Volume or a NFS mount.

Once done, you are ready to use Google Dataproc from SAS Viya.

How to connect to Google Dataproc from SAS Viya?

To access Google Dataproc from SAS Compute, you first need to assign two environment variables to point to the Hadoop configuration and Jar files collected earlier (again recommended, not mandatory). You also need the IP address of the Dataproc cluster’s name node. Then you can use the Hadoop libname engine along with the CData Apache Hive JDBC driver included in SAS Viya:

option set=SAS_HADOOP_JAR_PATH="/dataproc/jars" ; 
option set=SAS_HADOOP_CONFIG_PATH="/dataproc/conf" ;
libname dataproc hadoop user="hdfs" password="pw" read_method=hdfs 
   driverclass="cdata.jdbc.apachehive.ApacheHiveDriver"
   uri="jdbc:apachehive:Server=<ip-address>;QueryPassthrough=True;Database=default;DefaultColumnSize=1024" ;

From CAS, the CASLIB statement will look similar to the Hadoop libname:

caslib casdp datasource=(srctype="hadoop"
   username="hdfs" password="pw" 
   uri="jdbc:apachehive:Server=<ip-address>;QueryPassthrough=True;Database=default;DefaultColumnSize=1024" hadoopJarPath="/dataproc/jars" hadoopConfigDir="/dataproc/conf" ) libref=casdp ;

The hadoopJarPath and hadoopConfigDir replace the two environment variables used in SAS Compute.

Now you are ready to read/process/write Google Dataproc data from SAS Viya.

Special thanks to my colleague Bill Oliver for his help on this topic.

Thanks for reading.

Find more articles from SAS Global Enablement and Learning here.