Access Microsoft Azure Storage & Big Data

2 Likes

The popularity of cloud data storage has grown exponentially over the last decade as more and more organizations are transitioning from on-premises to cloud data storage and data management. Microsoft Azure is one of the big players accelerating the move to cloud. In this article, we will look at the overview of SAS® to access data in Azure Storage and Big Data.

As organizations start to realize the impact of digital transformation, they are moving storage to the cloud as they move their computing to the cloud. Data Storage in the cloud is elastic and responds to demand while only paying for what you use, similar to compute in the cloud. Organizations must consider data storage options, efficient cloud platform and services, and migrating SAS applications to the cloud.

Organizations can connect from the SAS platform and access data from the various Azure data storage offerings. Whether it is Azure Data Lake Storage or Big Data HDInsight, SAS/ACCESS engines and data connectors have these covered with optimized data handling abilities to empower organizations going through digital transformation journey.

SAS AND AZURE DATA LAKE STORAGE

SAS Viya can read and write ORC and CSV data files to Azure Data Lake Storage Generation 2(ADLS2). There is a new data connector called SAS ORC Data Connector to facilitate the data transfer between CAS and ADLS2. The SAS ORC Data Connector enables you to load data from an Apache Optimized Row Columnar table into CAS. This data connector can be used with a path or Azure ADLS2 CASLIB.

SAS LIBRARY TO ADLS

/*create a CAS session */
cas casauto;
caslib "Azure Data Lake Storage Gen 2" datasource=( srctype="adls"
accountname=”sasdemo”
filesystem="data"
dnsSuffix=dfs.core.windows.net
timeout=50000 tenantid=<Azure Application Tenant ID UUID> applicationId=<Azure registered application UUID>
)
path="/" subdirs global
/* creates library reference for SAS compute */
libref=AzureDL; caslib _all_ assign;

Here is an explanation of the parameters that are used to create a caslib:

CASLIB – A library reference. The caslib is the space holder for the specified data access. The Azure Data Lake Storage Gen 2 CAS library is used to specify the ADLS data source.
SRCTYPE – Source type is ADLS which corresponds to Azure Data Lake Storage connection.
ACCOUNTNAME – Azure Data Lake Storage account name.
FILESYSTEM – ADLS container file system name.
TENANTID & APPLICATIONID – Available from the Azure Registered Application page for your organization or individual use.
PATH – Points to the directory structure where the file system resides.
LIBREF – Creates a SAS library reference along with a CAS library.

SAS Viya uses the available CAS session CASAUTO to create the CAS library reference to ADLS. In this example, it uses a CAS clustered environment with 3 nodes, including 2 worker nodes. To make the CAS library available to all the users, global parameters can be used. ORC or CSV can be loaded from the ADLS blob container file system “data” to a CAS in-memory cluster or saved to ADLS from CAS.

LOAD AND SAVE ORC DATA TO AZURE STORAGE FROM SAS VIYA

Let’s look at an example to load a SAS dataset to a CAS in-memory server. Once the data is in CAS, it can be used for any distributed data processing, report, analytics, or modeling. The final CAS in-memory data output is saved as an ORC file to Azure Data Lake Storage.

proc casutil;
load data=sashelp.class casout="class" outcaslib="adls" replace; save casdata="class" casout="class.orc" replace;
quit;

Important parameters:

PROC CASUTIL – CASUTIL procedure that works with CAS tables including data transfer, table & file information, and drop & delete tables.
LOAD – CAS action set to load data to CAS in-memory servers

DATA – SAS traditional data set used as input
CASOUT – Output CAS in-memory distributed table
OUTCASLIB – Output CAS Library reference
SAVE – CAS action set to save CAS in-memory table to ADLS
CASDATA – Input CAS table
CASOUT – ORC output file saved to ADLS

SAS AND AZURE HDINSIGHT HADOOP DATA LAKE

Traditionally, the Hadoop platform has been deployed to solve an organization’s Data Lake business needs on-premise. With more and more incoming data along with deploying analytical applications such as SAS Viya in the cloud, organizations are moving some of their Data Lake needs to the cloud along with the data. Azure HDInsight is one such Hadoop service that provides a cloud native Hadoop platform along with elasticity to scale up or down the Data Lake. The SAS platform can leverage Azure HDInsight Hadoop service to access, load, or save data using the SAS Access to Hadoop engine.

LIBNAME STATEMENT TO CONNECT AND ACCESS DATA

option set = SAS_HADOOP_CONFIG_PATH = "/opt/hdinsight/conf"; option set = SAS_HADOOP_JAR_PATH = "/opt/hdinsight/jars";
options sastrace = ',,,sd' sastraceloc = saslog no$stsuffix fullstimer;
libname hdilib HADOOP uri='jdbc:hive2://sgfdemo.azurehdinsight.net:443/default;ssl=true?hive.server 2.transport.mode=http;hive.server2.thrift.http.path=hive2' server='sgfdemo.azurehdinsight.net' user="admin" pw="demopassword" schema="default";

As a one-time configuration setup, before creating a SAS library reference and connecting to an Azure HDInsight cluster, the Hadoop administrator would execute a HadoopTracer Python script available from SAS to extract all the necessary JAR and config files. The SAS administrator can further use these extracted files to connect to an Azure HDInsight cluster by setting two option variables. This step is performed regardless of the type of Hadoop distribution or cloud platform.

Some of the important parameters:

SAS_HADOOP_CONFIG_PATH - Specifies the directory path for the Azure HDInsight cluster configuration files.
SAS_HADOOP_JAR_PATH - Specifies the directory path for the Azure HDInsight JAR files.
URI – Azure HDInsight JDBC URI to connect to Hive server 2. Once a similar JDBC URI is retrieved from Azure HDInsight documentation, just modify the HDInsight server name.

The JDBC URI contains some of the necessary parameters that are enabled and assigned values by default. SSL is set to true by default, and REST transport mode is set to HTTP. Data can be loaded from an Azure HDInsight cluster to the SAS platform or saved to the cloud. SAS PROCs such as Proc Append, Sort, Summary, Means, Rank, Freq, and Transpose are supported on Azure HDInsight cluster. Furthermore, the DATA step and PROC SQL data preparation with bulk load are handled to save data efficient on the cloud.

SAS can access data directly from Azure Storage and Big Data or through other Azure intermediary data sources that uses Azure ADLS2 or HDInsight as the persistent storage layer. As the end users migrate from on-premise to cloud, they also asses the data storage landscape instead of just lifting and shifting to similar data source in the cloud to address Self-Service data management. Variety of data types such as structured, unstructured and the need to access data faster are some of the key deciding factors to choose cloud blob or big data storage options.