With the SAS Viya3.5 release, the SAS Viya programming environment (SAS 9.4) can read and write ORC data files to and from Azure Data Lake Storage 2 (ADLS2). We have a new Base SAS ORC engine to support data transfer between SAS 9.4 and ADLS2.
This blog talks about various components involved for SAS to access orc files from Azure ADLS2.
Apache ORC (Optimized Row Columnar) is an open-source column-oriented data storage file format of the Apache Hadoop ecosystem. In column-based storage, you quickly access only the columns that you need to query, which can be highly efficient for analyzing big data.
The following diagram describes the data access components and data path from Base SAS 9.4 to ADLS2.
Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.
Before you can access the ADLS2 storage, you need to create a new or use existing Azure Application and Storage Account. The Azure application and Storage Account require appropriate configurations to access ADLS2 storage. The detailed steps with screenshots are discussed in the blog post (CAS accessing ADLS2) under “Azure configuration to access ADSL2” section.
To access the Azure ADLS2 data file from SAS, you need the following information to execute an ORC LIBNAME statement.
An ORC LIBNAME uses device code authentication with Azure. Very first-time execution of ORC LIBNAME statement, it generates a misleading error message. It could not find the Azure access token at SAS compute server for SAS user. It created a .sasadls<>.json file under SAS user’s home directory with instructions to register the device and obtain the Azure Access token. The .sasadls<>.json file created for each SAS user. As part of the instruction, you need to login to https://microsoft.com/devicelogin Microsoft page and validate the listed device. Note: The default location for the Azure authorization key file is the home directory of the SAS user (e.g /home/viyademo01/.sasadls_.jason). Error from very first-time execution of ORC LIBNAME statement:
90 ;
ERROR: Invalid physical name for library ORCLIB.
ERROR: Error in the LIBNAME statement.
91
After you log on to listed Microsoft page a series of windows provide further instruction to authorize the device. The subsequent attempts to access the data succeed and generates an Azure authorization key file at SAS compute server under the user’s home directory.
[viyademo01@intviya01 ~]$ ls -l /home/viyademo01/.sasadls* -rw-------. 1 viyademo01 sasusers 5718 Feb 13 11:52 /home/viyademo01/.sasadls_100001.json [viyademo01@intviya01 ~]$
The following code describes the data save and read from ADLS2 Blob storage to SAS 9.4. In step-1, it executes the ORC LIBAME statement. In step-2, it saves SAS data sets to ADLS2 storage in ORC data file format. In step-3, it reads saved ORC file from ADLS2 to SAS. Code:
/* Step-1 */ libname orclib ORC "/sample_data" storage_account_name = "utkuma5adls2strg" storage_file_system = "fsutkuma5adls2strg" storage_dns_suffix = "dfs.core.windows.net" storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" DIRECTORIES_AS_DATA=YES FILE_NAME_EXTENSION=(orc ORC) ; /* Step-2 */ data orclib.fish_orc; set sashelp.fish; run; /* Step-3 */ Proc SQL ; select * from orclib.fish_orc ; run; quit;
Log extract:
………… ………… 82 data orclib.fish_orc ; 83 set sashelp.fish ; 84 run; NOTE: There were 159 observations read from the data set SASHELP.FISH. NOTE: The data set ORCLIB.fish_orc has 159 observations and 7 variables. NOTE: DATA statement used (Total process time): real time 0.90 seconds cpu time 0.11 seconds 85 …. ……………
The following screenshot describes the ORC data file saved to Azure ADLS2 by executing the above statement.
The following screenshot describes the ORC data file read from Azure ADLS2 by executing the above statement.
The SAS ORC engine supports reading n-number of orc files from an ADLS2 sub-folder. The LIBNAME definition support “DIRECTORIES_AS_DATA=YES” to enable read the sub-folder name as one table. The following screenshot describes more than one orc file in an ADLS2 sub-folder. These data files can be read at the sub-folder level and all data will end-up in a single SAS table.
The following code describes multiple orc file read from an ADLS2 sub-folder into a SAS table. The LIBNAME statement using “DIRECTORIES_AS_DATA=YES” to enable the read. Code:
/* Step-1 */ libname orclib ORC "/sample_data" storage_account_name = "utkuma5adls2strg" storage_file_system = "fsutkuma5adls2strg" storage_dns_suffix = "dfs.core.windows.net" storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXX-XXXXXXXX" storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" DIRECTORIES_AS_DATA=YES FILE_NAME_EXTENSION=(orc ORC) ; /* Step-2 */ Proc SQL ; select * from orclib.fish_n ; run; quit;
The SAS ORC engine supports default ZLIB data compression while saving data to ADLS2. SAS can read compression types SNAPPY, LZO, LZ4, and ZSTD. However, these types are not supported for Write access.
There are some restrictions for SAS features and ORC file features. Few of them are listed here, for a more comprehensive list see the documentation.
Important Link: Base SAS ORC LIBNAME Engine Related Blog: SAS Viya 3.5 – CAS accessing Azure Data Lake files
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.