BookmarkSubscribeRSS Feed

Access Microsoft Azure Storage & Big Data

Started ‎12-26-2020 by
Modified ‎12-26-2020 by
Views 4,665

The popularity of cloud data storage has grown exponentially over the last decade as more and more organizations are transitioning from on-premises to cloud data storage and data management. Microsoft Azure is one of the big players accelerating the move to cloud. In this article, we will look at the overview of SAS® to access data in Azure Storage and Big Data.

 

As organizations start to realize the impact of digital transformation, they are moving storage to the cloud as they move their computing to the cloud. Data Storage in the cloud is elastic and responds to demand while only paying for what you use, similar to compute in the cloud. Organizations must consider data storage options, efficient cloud platform and services, and migrating SAS applications to the cloud.

 

Organizations can connect from the SAS platform and access data from the various Azure data storage offerings. Whether it is Azure Data Lake Storage or Big Data HDInsight, SAS/ACCESS engines and data connectors have these covered with optimized data handling abilities to empower organizations going through digital transformation journey.

 

SAS AND AZURE DATA LAKE STORAGE

SAS Viya can read and write ORC and CSV data files to Azure Data Lake Storage Generation 2(ADLS2). There is a new data connector called SAS ORC Data Connector to facilitate the data transfer between CAS and ADLS2. The SAS ORC Data Connector enables you to load data from an Apache Optimized Row Columnar table into CAS. This data connector can be used with a path or Azure ADLS2 CASLIB.

 

SAS LIBRARY TO ADLS
/*create a CAS session */
cas casauto;
caslib "Azure Data Lake Storage Gen 2" datasource=( srctype="adls"
accountname=”sasdemo”
filesystem="data"
dnsSuffix=dfs.core.windows.net
timeout=50000 tenantid=<Azure Application Tenant ID UUID> applicationId=<Azure registered application UUID>
)
path="/" subdirs global
/* creates library reference for SAS compute */
libref=AzureDL; caslib _all_ assign;

Here is an explanation of the parameters that are used to create a caslib:

  • CASLIB A library reference. The caslib is the space holder for the specified data access. The Azure Data Lake Storage Gen 2 CAS library is used to specify the ADLS data source.

  • SRCTYPE Source type is ADLS which corresponds to Azure Data Lake Storage connection.

  • ACCOUNTNAME Azure Data Lake Storage account name.

  • FILESYSTEM ADLS container file system name.

  • TENANTID & APPLICATIONID Available from the Azure Registered Application page for your organization or individual use.

  • PATH Points to the directory structure where the file system resides.

  • LIBREF Creates a SAS library reference along with a CAS library.

SAS Viya uses the available CAS session CASAUTO to create the CAS library reference to ADLS. In this example, it uses a CAS clustered environment with 3 nodes, including 2 worker nodes. To make the CAS library available to all the users, global parameters can be used. ORC or CSV can be loaded from the ADLS blob container file system “data” to a CAS in-memory cluster or saved to ADLS from CAS.

 

LOAD AND SAVE ORC DATA TO AZURE STORAGE FROM SAS VIYA

Let’s look at an example to load a SAS dataset to a CAS in-memory server. Once the data is in CAS, it can be used for any distributed data processing, report, analytics, or modeling. The final CAS in-memory data output is saved as an ORC file to Azure Data Lake Storage.

proc casutil;
load data=sashelp.class casout="class" outcaslib="adls" replace; save casdata="class" casout="class.orc" replace;
quit;

Important parameters:

  • PROC CASUTIL CASUTIL procedure that works with CAS tables including data transfer, table & file information, and drop & delete tables.

  • LOAD CAS action set to load data to CAS in-memory servers

  • DATA SAS traditional data set used as input

  • CASOUT Output CAS in-memory distributed table

  • OUTCASLIB Output CAS Library reference

  • SAVE CAS action set to save CAS in-memory table to ADLS

  • CASDATA Input CAS table

  • CASOUT ORC output file saved to ADLS

 

SAS AND AZURE HDINSIGHT HADOOP DATA LAKE

          Traditionally, the Hadoop platform has been deployed to solve an organizations Data Lake business needs on-premise. With more and more incoming data along with deploying analytical applications such as SAS Viya in the cloud, organizations are moving some of their Data Lake needs to the cloud along with the data. Azure HDInsight is one such Hadoop service that provides a cloud native Hadoop platform along with elasticity to scale up or down the Data Lake. The SAS platform can leverage Azure HDInsight Hadoop service to access, load, or save data using the SAS Access to Hadoop engine.

 

LIBNAME STATEMENT TO CONNECT AND ACCESS DATA
option set = SAS_HADOOP_CONFIG_PATH = "/opt/hdinsight/conf"; option set = SAS_HADOOP_JAR_PATH = "/opt/hdinsight/jars";
options sastrace = ',,,sd' sastraceloc = saslog no$stsuffix fullstimer;
libname hdilib HADOOP uri='jdbc:hive2://sgfdemo.azurehdinsight.net:443/default;ssl=true?hive.server 2.transport.mode=http;hive.server2.thrift.http.path=hive2' server='sgfdemo.azurehdinsight.net' user="admin" pw="demopassword" schema="default";

As a one-time configuration setup, before creating a SAS library reference and connecting to an Azure HDInsight cluster, the Hadoop administrator would execute a HadoopTracer Python script available from SAS to extract all the necessary JAR and config files. The SAS administrator can further use these extracted files to connect to an Azure HDInsight cluster by setting two option variables. This step is performed regardless of the type of Hadoop distribution or cloud platform.

 

Some of the important parameters:

  • SAS_HADOOP_CONFIG_PATH - Specifies the directory path for the Azure HDInsight cluster configuration files.

  • SAS_HADOOP_JAR_PATH - Specifies the directory path for the Azure HDInsight JAR files.

  • URI Azure HDInsight JDBC URI to connect to Hive server 2. Once a similar JDBC URI is retrieved from Azure HDInsight documentation, just modify the HDInsight server name.

The JDBC URI contains some of the necessary parameters that are enabled and assigned values by default. SSL is set to true by default, and REST transport mode is set to HTTP. Data can be loaded from an Azure HDInsight cluster to the SAS platform or saved to the cloud. SAS PROCs such as Proc Append, Sort, Summary, Means, Rank, Freq, and Transpose are supported on Azure HDInsight cluster. Furthermore, the DATA step and PROC SQL data preparation with bulk load are handled to save data efficient on the cloud.

 

SAS can access data directly from Azure Storage and Big Data or through other Azure intermediary data sources that uses Azure ADLS2 or HDInsight as the persistent storage layer. As the end users migrate from on-premise to cloud, they also asses the data storage landscape instead of just lifting and shifting to similar data source in the cloud to address Self-Service data management. Variety of data types such as structured, unstructured and the need to access data faster are some of the key deciding factors to choose cloud blob or big data storage options.

Version history
Last update:
‎12-26-2020 01:44 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags