BookmarkSubscribeRSS Feed

Accessing Google Dataproc With SAS Viya

Started ‎03-17-2023 by
Modified ‎03-17-2023 by
Views 827

It has already been almost 8 years since Hadoop reached its peak in terms of interest in web searches. Spark then took over and Hadoop/Spark services are still widely used at numerous companies.

 

nir_post_84_01_hadoop_trends-1536x539.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

SAS always tries to keep providing support for a countless number of data providers. And recently, Google Dataproc has been added to the data sources SAS Viya can work with.

 

What is Google Dataproc?

 

Google Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, and 30+ open source tools and frameworks.”

 

Dataproc is very accessible and you can provision a ready-to-use cluster in 90 seconds.  

 

When did SAS Viya start supporting Google Dataproc?

 

Hadoop (Hive) support for Google Dataproc has been introduced in SAS Viya Stable Release 2022.11 through the use of the included CData JDBC driver for Apache Hive.  

 

What are the configurations steps?

 

The following steps have been tested with a Google Dataproc cluster setup on Compute Engines. Not discussed here are the required steps to allow network communication between SAS Viya and Google Dataproc (it highly depends on your setup).  

 

  1. Collect the configuration and JAR files

This step is recommended if you want to access HDFS under the covers, but not mandatory. Indeed, bulkload and bulkunload capabilities will be unlocked if HDFS can be used by SAS for staging data during read and write operations. HDFS access requires SAS to have access to certain configuration and JAR files collected beforehand.

 

To do this task, we will need to run the Hadoop Tracer Script on the Google Dataproc cluster’s name node.

 

The Hadoop Tracer Script can be downloaded from http://ftp.sas.com/techsup/download/blind/access/hadooptracer.zip. It has been recently updated to work with Google Dataproc.

 

We need to copy this zip file to the Dataproc’s name node and unzip it. The archive contains a Python script and a required json file.

 

The Hadoop Tracer Script makes use of the strace tool which needs to be install prior to running the script ("sudo apt-get install strace -y" for example on the Debian image used by Google Dataproc).

 

We are ready to run the script using the following command:

 

python ./hadooptracer_py --filterby=latest --postprocess --jsonfile ./driver.json --jars ./jars --conf ./conf --logfile ./tracer.log

 

This script can take 15 minutes to run. At the end, you should have a collection of configuration and JAR files in the ./jars and ./conf relative sub-folders on Dataproc’s name node.  

 

  1. Make the files available to SAS Compute Server and CAS

 

The collected files need to be available to Compute and CAS. Therefore, you need to copy them from the Dataproc name node to a location that will be available to both through a K8s Persistent Volume or a NFS mount.

 

Once done, you are ready to use Google Dataproc from SAS Viya.  

 

How to connect to Google Dataproc from SAS Viya?

 

To access Google Dataproc from SAS Compute, you first need to assign two environment variables to point to the Hadoop configuration and Jar files collected earlier (again recommended, not mandatory). You also need the IP address of the Dataproc cluster’s name node. Then you can use the Hadoop libname engine along with the CData Apache Hive JDBC driver included in SAS Viya:

 

option set=SAS_HADOOP_JAR_PATH="/dataproc/jars" ; 
option set=SAS_HADOOP_CONFIG_PATH="/dataproc/conf" ;
libname dataproc hadoop user="hdfs" password="pw" read_method=hdfs 
   driverclass="cdata.jdbc.apachehive.ApacheHiveDriver"
   uri="jdbc:apachehive:Server=<ip-address>;QueryPassthrough=True;Database=default;DefaultColumnSize=1024" ;

 

From CAS, the CASLIB statement will look similar to the Hadoop libname:

 

caslib casdp datasource=(srctype="hadoop"
   username="hdfs" password="pw" 
   uri="jdbc:apachehive:Server=<ip-address>;QueryPassthrough=True;Database=default;DefaultColumnSize=1024" hadoopJarPath="/dataproc/jars" hadoopConfigDir="/dataproc/conf" ) libref=casdp ;

 

The hadoopJarPath and hadoopConfigDir replace the two environment variables used in SAS Compute.  

 

Now you are ready to read/process/write Google Dataproc data from SAS Viya.

 

Special thanks to my colleague Bill Oliver for his help on this topic.  

 

Thanks for reading.

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
‎03-17-2023 09:23 AM
Updated by:

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Labels