Azure HDInsight is a cloud distribution of Hadoop components. A SAS user may wonder how to connect an application with an HDI cluster. The SAS/ACCESS Interface to Hadoop and SAS Data connector to Hadoop technology enables SAS users to connect and access data from the HDInsight cluster.
This post is about the components required for SAS 9.4 (M6/7) and CAS to connect and access the Hive tables from the Azure HDInsight cluster.
Azure HDInsight is a managed, open-source Hadoop cluster service in the cloud. Azure HDInsight enables a user to create a global and optimized cluster with open-source frameworks such as Apache Hadoop, Spark, Hive, Interactive Query, Kafka, etc. . It provides a scalable Hadoop cluster that enables users to scale up or down based on workload. HDInsight integrates seamlessly with the most popular big data solutions
The SAS 9.4 (M6/7) can access hive tables from the HDInsight Hadoop cluster using SAS/ACCESS Interface to Hadoop technology. The following diagram describes the data path and components required to access the HDInsight Hadoop cluster.
The HDInsight Hadoop cluster allows ssh to login to the master node server. A valid user can log in to the HDI master node and execute the SAS-Hadoop-Tracer script to collect required Hadoop Jars and config files.
The SAS Embedded Process ( SAS EP) for Hadoop is not supported at the HDInsight Hadoop cluster, hence the in-database processing components (Code Accelerator, Data Quality Accelerator, etc.) are not supported.
option set = SAS_HADOOP_CONFIG_PATH = "/opt/sas/viya/config/data/HDIHadoopData/conf";
option set = SAS_HADOOP_JAR_PATH = "/opt/sas/viya/config/data/HDIHadoopData/lib";
libname hdilib HADOOP
uri='jdbc:hive2://hdiutkuma5.azurehdinsight.net:443/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2'
server='hdiutkuma5.azurehdinsight.net'
user="admin" pw=“XXXXXXXXX" schema="default";
data hdilib.cars_test;
set sashelp.cars;
run ;
CAS supports the Serial mechanism to load and save data to Azure HDInsight Hive table using SAS Data Connector to Hadoop. The following diagram describes the data path and components requires to access the Azure HDInsight.
The HDInsight Hadoop cluster allows ssh to log in to the master node server. A valid user can log in to the HDI master node and execute the SAS-Hadoop-Tracer script to collect required Hadoop Jars and config files.
The SAS Embedded Process ( SAS EP) for Hadoop is not yet supported at the HDInsight Hadoop cluster. Hence, the parallel data load to CAS is not supported. There is a known bug with the Multi-Node load method, the SAS data connector for Hadoop does not work well with specific hive numeric data column to create the split query. Hence, the Multi-Node data load/save to CAS is not supported.
CAS mySession SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib cashive datasource=(srctype="hadoop",
server="hdiutkuma8.azurehdinsight.net",
username="admin", pwd=“XXXXXXXXXX", schema=default,
uri='jdbc:hive2://hdiutkuma8.azurehdinsight.net:443/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2',
hadoopconfigdir="/opt/sas/viya/config/data/HDIHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/HDIHadoopData/lib");
proc casutil incaslib="cashive" outcaslib="cashive" ;
load data=sashelp.prdsal3 casout="prdsal3" replace;
save casdata="prdsal3" casout="prdsal3" replace;
quit;
proc casutil incaslib="cashive" outcaslib="cashive";
load casdata="prdsal3" casout="prdsal3_new" replace;
list tables;
quit;
CAS mySession TERMINATE;
Stay tuned for the next article on "SAS and CAS reading ADL2 files (Parquet, Avro, Jason, XML, etc.) via the HDInsight cluster".
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.