BookmarkSubscribeRSS Feed

SAS Viya/CAS accessing Azure Blob Storage data files via HDInsight.

Started ‎12-16-2020 by
Modified ‎12-16-2020 by
Views 6,950

The SAS Viya/CAS and Viya SPRE support data file access from Azure Blob Storage (ADLS2). The support is for limited data file types. The SPRE supports ORC and CSV data files read/write using ORC LIBNAME engine. The SAS Viya 3.5/CAS supports ORC and CSV data files read/write using ADLS CASLIB. The SAS Viya 4.0 supports additional parquet data files read/write using ADLS CASLIB. The BASE SAS 9.4 does not have a LIBNAME engine to read/write data files from ADLS2 Blob storage.

 

With the limited option, how do you load CAS or read into SAS 9.4 if you have avro, json, parquet, xml etc. data files at Azure Blob storage?

 

Azure HDInsight can help in this situation! Azure HDInsight can read all these data files from the Blob storage as Hive tables. The SAS application can access these HDI hive table using SAS/ACCESS Interface to Hadoop or SAS Data connector to Hadoop.

 

The following steps integrate Blob Storage with the HDI cluster and enable users to read and write Azure Blob Storage data files from BASE SAS9.4 and Viya CAS.  

 

Create Azure HDInsight cluster with access to Blob Storage

Create the Azure HDInsight Cluster with a Managed Azure Identity having access to Blob Storage. It enables the HDI cluster for seamless access to Azure Blob Storage. The following screen describes the Azure Managed Identity with access to Blob Storage.

 

uk_SAS_CAS_Aaccess_to_BlobStorage_Via_HDI_1.png

 

 

The following screen describes the usage of ADLS2 Blob Storage and Managed Identity while creating the HDInsight cluster.

 

uk_SAS_CAS_Aaccess_to_BlobStorage_Via_HDI_2.png

 

Create HDI Hive database and table with data files at ADLS2

To enable the HDI Hive services to read and write data to ADLS2 storage, it requires schema created with data files located at ADLS2. With access to the HDInsight Hadoop cluster, you can collect the Hadoop Jar and config files in the SAS environment. The HDInsight Hadoop cluster allows ssh to login to the master node server. A valid user can log in to the HDI master node and execute the SAS-Hadoop-Tracer script to collect required Hadoop Jars and config files.  

 

Sample code to create HDI Hive database

 

%let HDINM=hdiutkuma8;
%let MYSTRGACC=utkuma8adls2strg;
%let MYSTRGFS=fsdata;

option set = SAS_HADOOP_CONFIG_PATH = "/opt/sas/viya/config/data/HDIHadoopData/conf";
option set = SAS_HADOOP_JAR_PATH = "/opt/sas/viya/config/data/HDIHadoopData/lib";
options sastrace = ',,,sd' sastraceloc = saslog ;

PROC SQL ;
connect to hadoop as hvlib (uri="jdbc:hive2://&HDINM..azurehdinsight.net:443/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2"
server="&HDINM..azurehdinsight.net" user="admin" pw="XXXXXXX" schema="default");
execute (CREATE database geltest LOCATION "abfs://&MYSTRGFS@&MYSTRGACC..dfs.core.windows.net/hive_db"
) by hvlib;
quit;

 

By executing the above code, it creates an HDI Hive database with a data folder located at ADLS2. When BASE SAS 9.4 or CAS uses "geltest" database/schema name in LIBNAME or CASLIB, it will access data from ADLS2. Any data save from BASE SAS 9.4 or CAS will end-up at ADLS2 storage.

 

 uk_SAS_CAS_Aaccess_to_BlobStorage_Via_HDI_3.png

 

Sample code to create HDI Hive table

For existing datafiles (created by third-party applications), you can create table schema using the following example statement. Notice the Database/Schema name and data file location used as ADSL2 Blob Storage location.

 

 

%let HDINM=hdiutkuma8;
%let MYSTRGACC=utkuma8adls2strg;
%let MYSTRGFS=fsdata;

option set = SAS_HADOOP_CONFIG_PATH = "/opt/sas/viya/config/data/HDIHadoopData/conf";
option set = SAS_HADOOP_JAR_PATH = "/opt/sas/viya/config/data/HDIHadoopData/lib";
options sastrace = ',,,sd' sastraceloc = saslog ;


PROC SQL ;
connect to hadoop as hvlib (uri="jdbc:hive2://&HDINM..azurehdinsight.net:443/geltest;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2"
server="&HDINM..azurehdinsight.net" user="admin" pw="XXXXXX" schema="geltest");
execute (drop table flight_reporting ) by hvlib;
execute (CREATE EXTERNAL TABLE flight_reporting (
    YEAR varchar(4) ,
    MONTH varchar(4),
    ORIGIN_AIRPORT_ID varchar(10),
    ORIGIN_AIRPORT_SEQ_ID varchar(10),
    ORIGIN_CITY_MARKET_ID varchar(10),
    ORIGIN_CITY_NAME varchar(20),
    DEST_AIRPORT_ID varchar(10),
    DEST_AIRPORT_SEQ_ID varchar(10),
    DEST_CITY_MARKET_ID varchar(10),
    DEST_CITY_NAME varchar(20)
    )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "abfs://&MYSTRGFS@&MYSTRGACC..dfs.core.windows.net/flight_data"
) by hvlib;

 

 

 

SAS 9.4 access to ADLS2 data files via HDI

The SAS 9.4 (M6/7) can access hive tables from the HDInsight Hadoop cluster using SAS/ACCESS Interface to Hadoop technology. The following diagram describes the data access path from SAS 9.4 to Blob Storage via HDInsight.

 

uk_SAS_CAS_Aaccess_to_BlobStorage_Via_HDI_4.png

 

Pre-requisites

  • SAS/ACCESS Interface to Hadoop at SAS Compute server.
  • HDInsight Hadoop cluster with Hive Catalog.
  • HDI Hadoop Jars and Configuration files at SAS Compute server.
  • JAVA_HOME path in LD_LIBRARY_PATH at SAS Compute server.
  • Database/Table schema created at Hive with data file location to ADLS2.

 

The SAS Embedded Process ( SAS EP) for Hadoop is not supported at the HDInsight Hadoop cluster, hence the in-database processing components (Code Accelerator, Data Quality Accelerator, etc.) are not supported.  

Sample code to read/write Blob Storage data files via HDInsight

The following code describes the connection to HDInsight using Hadoop LIBNAME engine. The LIBNAME statement using “geltest” as schema having data folder located at ADLS2. The data save using this LIB creates a subfolder and data files at ADLS2.

 

%let HDINM=hdiutkuma8;

option set = SAS_HADOOP_CONFIG_PATH = "/opt/sas/viya/config/data/HDIHadoopData/conf";
option set = SAS_HADOOP_JAR_PATH = "/opt/sas/viya/config/data/HDIHadoopData/lib";
options sastrace = ',,,sd' sastraceloc = saslog ;


libname hdilib HADOOP
uri="jdbc:hive2://&HDINM..azurehdinsight.net:443/geltest;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2"
server="&HDINM..azurehdinsight.net"
user="admin" pw="XXXXXXX" schema="geltest"
DBCREATE_TABLE_OPTS="stored as parquet" ;

data hdilib.prdsal2;
   set sashelp.prdsal2;
run;

 

Data File at ADLS2:

 

uk_SAS_CAS_Aaccess_to_BlobStorage_Via_HDI_5.png

  

CAS access to ADLS2 data files via HDI

CAS supports the Serial mechanism to load and save data to Azure HDInsight Hive table using SAS Data Connector to Hadoop. The following diagram describes the data path access from CAS to Blob Storage via HDInsight.

 

uk_SAS_CAS_Aaccess_to_BlobStorage_Via_HDI_6.png

 

Pre-requisites

  • SAS Data Connector to Hadoop at CAS Controller.
  • HDInsight Hadoop cluster with Hive Catalog.
  • HDI Hadoop Jars and Configuration files at CAS Controller
  • JAVA_HOME path in LD_LIBRARY_PATH at CAS configuration setting.
  • Database/Table schema created at Hive with data file location to ADLS2.

The SAS Embedded Process ( SAS EP) for Hadoop is not yet supported at the HDInsight Hadoop cluster. Hence, the parallel data load to CAS is not supported. There is a known bug with the Multi-Node load method, the SAS data connector for Hadoop does not work well with specific hive numeric data column to create the split query. Hence, the Multi-Node data load/save to CAS is not supported.  

Sample code to read/write Blob Storage data files via HDInsight

%let HDINM=hdiutkuma8;

CAS mySession  SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib cashive datasource=(srctype="hadoop",
server="&HDINM..azurehdinsight.net",
username="admin", pwd="XXXXXXX", schema=geltest,
uri="jdbc:hive2://&HDINM..azurehdinsight.net:443/geltest;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2",
hadoopconfigdir="/opt/sas/viya/config/data/HDIHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/HDIHadoopData/lib");

proc casutil incaslib="cashive" outcaslib="cashive"  ;
load data=sashelp.prdsal3 casout="prdsal3" replace;
save casdata="prdsal3" casout="prdsal3"  replace;
quit;

proc casutil incaslib="cashive" outcaslib="cashive";
load casdata="prdsal3" casout="prdsal3_new" replace;
list tables;
quit;

CAS mySession  TERMINATE;

 

 

Important Link: Azure HDInsight Hadoop Data Connector

 

Blog: SAS Viya/CAS accessing Azure HDInsight  

Version history
Last update:
‎12-16-2020 10:18 AM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags