It has already been almost 8 years since Hadoop reached its peak in terms of interest in web searches. Spark then took over and Hadoop/Spark services are still widely used at numerous companies.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
SAS always tries to keep providing support for a countless number of data providers. And recently, Google Dataproc has been added to the data sources SAS Viya can work with.
“Google Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, and 30+ open source tools and frameworks.”
Dataproc is very accessible and you can provision a ready-to-use cluster in 90 seconds.
Hadoop (Hive) support for Google Dataproc has been introduced in SAS Viya Stable Release 2022.11 through the use of the included CData JDBC driver for Apache Hive.
The following steps have been tested with a Google Dataproc cluster setup on Compute Engines. Not discussed here are the required steps to allow network communication between SAS Viya and Google Dataproc (it highly depends on your setup).
This step is recommended if you want to access HDFS under the covers, but not mandatory. Indeed, bulkload and bulkunload capabilities will be unlocked if HDFS can be used by SAS for staging data during read and write operations. HDFS access requires SAS to have access to certain configuration and JAR files collected beforehand.
To do this task, we will need to run the Hadoop Tracer Script on the Google Dataproc cluster’s name node.
The Hadoop Tracer Script can be downloaded from http://ftp.sas.com/techsup/download/blind/access/hadooptracer.zip. It has been recently updated to work with Google Dataproc.
We need to copy this zip file to the Dataproc’s name node and unzip it. The archive contains a Python script and a required json file.
The Hadoop Tracer Script makes use of the strace tool which needs to be install prior to running the script ("sudo apt-get install strace -y
" for example on the Debian image used by Google Dataproc).
We are ready to run the script using the following command:
python ./hadooptracer_py --filterby=latest --postprocess --jsonfile ./driver.json --jars ./jars --conf ./conf --logfile ./tracer.log
This script can take 15 minutes to run. At the end, you should have a collection of configuration and JAR files in the ./jars and ./conf relative sub-folders on Dataproc’s name node.
The collected files need to be available to Compute and CAS. Therefore, you need to copy them from the Dataproc name node to a location that will be available to both through a K8s Persistent Volume or a NFS mount.
Once done, you are ready to use Google Dataproc from SAS Viya.
To access Google Dataproc from SAS Compute, you first need to assign two environment variables to point to the Hadoop configuration and Jar files collected earlier (again recommended, not mandatory). You also need the IP address of the Dataproc cluster’s name node. Then you can use the Hadoop libname engine along with the CData Apache Hive JDBC driver included in SAS Viya:
option set=SAS_HADOOP_JAR_PATH="/dataproc/jars" ;
option set=SAS_HADOOP_CONFIG_PATH="/dataproc/conf" ;
libname dataproc hadoop user="hdfs" password="pw" read_method=hdfs
driverclass="cdata.jdbc.apachehive.ApacheHiveDriver"
uri="jdbc:apachehive:Server=<ip-address>;QueryPassthrough=True;Database=default;DefaultColumnSize=1024" ;
From CAS, the CASLIB statement will look similar to the Hadoop libname:
caslib casdp datasource=(srctype="hadoop"
username="hdfs" password="pw"
uri="jdbc:apachehive:Server=<ip-address>;QueryPassthrough=True;Database=default;DefaultColumnSize=1024" hadoopJarPath="/dataproc/jars" hadoopConfigDir="/dataproc/conf" ) libref=casdp ;
The hadoopJarPath and hadoopConfigDir replace the two environment variables used in SAS Compute.
Now you are ready to read/process/write Google Dataproc data from SAS Viya.
Special thanks to my colleague Bill Oliver for his help on this topic.
Thanks for reading.
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.