BookmarkSubscribeRSS Feed

SAS Viya 3.5 – SAS ORC LIBNAME engine to ADLS2

Started ‎06-05-2020 by
Modified ‎06-05-2020 by
Views 3,074

With the SAS Viya3.5 release, the SAS Viya programming environment (SAS 9.4) can read and write ORC data files to and from Azure Data Lake Storage 2 (ADLS2). We have a new Base SAS ORC engine to support data transfer between SAS 9.4 and ADLS2.

 

This blog talks about various components involved for SAS to access orc files from Azure ADLS2.  

 

ORC data file

Apache ORC (Optimized Row Columnar) is an open-source column-oriented data storage file format of the Apache Hadoop ecosystem. In column-based storage, you quickly access only the columns that you need to query, which can be highly efficient for analyzing big data.  

 

Storage system supported by SAS ORC engine

  • Azure Data Lake Storage 2 (ADLS2)
  • Linux for x64 supported by SAS

 

Pre-requisites

  • SAS ORC Engine at SAS Compute Server. (default install)
  • User access to Azure Storage Account with Storage Blob Data Contributor role.
  • Azure application with access permission to Azure Data lake and Azure Storage.

 

Data Path diagram

The following diagram describes the data access components and data path from Base SAS 9.4 to ADLS2.

 

SAS_Aaccess_to_ADLS2_1.png

Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.

 

 

Azure Configurations to access ADLS2

Before you can access the ADLS2 storage, you need to create a new or use existing Azure Application and Storage Account. The Azure application and Storage Account require appropriate configurations to access ADLS2 storage. The detailed steps with screenshots are discussed in the blog post (CAS accessing ADLS2) under “Azure configuration to access ADSL2” section.  

 

Azure information required to access ADLS2 data from CAS

To access the Azure ADLS2 data file from SAS, you need the following information to execute an ORC LIBNAME statement.  

    storage_account_name= “utkuma5adls2strg” storage_file_system="fsutkuma5adls2strg" storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXX" storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

 

SAS ORC LIBNAME Authentication

An ORC LIBNAME uses device code authentication with Azure. Very first-time execution of ORC LIBNAME statement, it generates a misleading error message. It could not find the Azure access token at SAS compute server for SAS user. It created a .sasadls<>.json file under SAS user’s home directory with instructions to register the device and obtain the Azure Access token. The .sasadls<>.json file created for each SAS user. As part of the instruction, you need to login to https://microsoft.com/devicelogin Microsoft page and validate the listed device. Note: The default location for the Azure authorization key file is the home directory of the SAS user (e.g /home/viyademo01/.sasadls_.jason).   Error from very first-time execution of ORC LIBNAME statement:


90   ;
ERROR: Invalid physical name for library ORCLIB.
ERROR: Error in the LIBNAME statement.
91
 

 

After you log on to listed Microsoft page a series of windows provide further instruction to authorize the device. The subsequent attempts to access the data succeed and generates an Azure authorization key file at SAS compute server under the user’s home directory.  

 

[viyademo01@intviya01 ~]$ ls -l /home/viyademo01/.sasadls*
-rw-------. 1 viyademo01 sasusers 5718 Feb 13 11:52 /home/viyademo01/.sasadls_100001.json
[viyademo01@intviya01 ~]$

 

Data read/write from SAS to ADLS2

The following code describes the data save and read from ADLS2 Blob storage to SAS 9.4. In step-1, it executes the ORC LIBAME statement. In step-2, it saves SAS data sets to ADLS2 storage in ORC data file format. In step-3, it reads saved ORC file from ADLS2 to SAS.   Code:

 

/* Step-1 */
libname orclib ORC "/sample_data"
      storage_account_name = "utkuma5adls2strg"
      storage_file_system = "fsutkuma5adls2strg"
      storage_dns_suffix = "dfs.core.windows.net"
      storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
      storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
      DIRECTORIES_AS_DATA=YES
      FILE_NAME_EXTENSION=(orc ORC)
;

/* Step-2 */
data orclib.fish_orc;
   set sashelp.fish;
run;

/* Step-3 */
Proc SQL ;
select * from orclib.fish_orc ;
run;
quit;

 

Log extract:

 

…………
…………
82   data orclib.fish_orc ;
83   set sashelp.fish ;
84   run;
NOTE: There were 159 observations read from the data set SASHELP.FISH.
NOTE: The data set ORCLIB.fish_orc has 159 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.90 seconds
      cpu time            0.11 seconds

85
….
……………
        

 

The following screenshot describes the ORC data file saved to Azure ADLS2 by executing the above statement.

 

SAS_Aaccess_to_ADLS2_2.png

 

 

The following screenshot describes the ORC data file read from Azure ADLS2 by executing the above statement.

 

SAS_Aaccess_to_ADLS2_3.png

 

 

SAS reading orc files from an ADLS2 sub-folder

The SAS ORC engine supports reading n-number of orc files from an ADLS2 sub-folder. The LIBNAME definition support “DIRECTORIES_AS_DATA=YES” to enable read the sub-folder name as one table. The following screenshot describes more than one orc file in an ADLS2 sub-folder. These data files can be read at the sub-folder level and all data will end-up in a single SAS table.

 

SAS_Aaccess_to_ADLS2_4.png

 

The following code describes multiple orc file read from an ADLS2 sub-folder into a SAS table. The LIBNAME statement using “DIRECTORIES_AS_DATA=YES” to enable the read.   Code:

 

/* Step-1 */
libname orclib ORC "/sample_data"
      storage_account_name = "utkuma5adls2strg"
      storage_file_system = "fsutkuma5adls2strg"
      storage_dns_suffix = "dfs.core.windows.net"
      storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXX-XXXXXXXX"
      storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
      DIRECTORIES_AS_DATA=YES 
      FILE_NAME_EXTENSION=(orc ORC)
;

/* Step-2 */
Proc SQL ;
select * from orclib.fish_n ; 
run;
quit;

 

Data Compression

The SAS ORC engine supports default ZLIB data compression while saving data to ADLS2. SAS can read compression types SNAPPY, LZO, LZ4, and ZSTD. However, these types are not supported for Write access.

    COMPRESS=NO | ZLIB

 

Restrictions

There are some restrictions for SAS features and ORC file features. Few of them are listed here, for a more comprehensive list see the documentation.

  • Data update to orc data file/s is not supported.
  • SAS Indexes, encryption, DS2 or Fed SQL procedures are not supported.
  • Compound data types, and Hive data partitioning are not supported.

    Important Link: Base SAS ORC LIBNAME Engine Related Blog: SAS Viya 3.5 – CAS accessing Azure Data Lake files    

Version history
Last update:
‎06-05-2020 10:04 AM
Updated by:
Contributors

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags