BookmarkSubscribeRSS Feed

SAS Viya 3.5 : CAS accessing Azure Data Lake files.

Started ‎03-26-2020 by
Modified ‎03-17-2022 by
Views 16,294

With SAS Viya3.5 release CAS can read and write ORC and CSV data files to Azure Data Lake Storage (ADLS2). There is a new data connector called “SAS ORC Data Connector” to facilitate the data transfer between CAS and ADLS2.

 

The SAS ORC Data Connector enables you to load data from Apache Optimized Row Columnar table into CAS. This Data connector can be used with a Path or Azure ADLS2 CASLIB. This blog talks about various components involved to access Azure ADLS2 data files from CAS.

 

CAS supported data file type at ADLS2

  • ORC
  • CSV 

 

Pre-requisites

  • SAS ORC Data Connector installed at CAS nodes.
  • User access to Azure Storage Account with Storage Blob Data Contributor role.
  • Azure application with access permission to Azure Data lake and Azure Storage. 

 

CAS supported data access methods for ADLS2

The CAS supports a mixed-mode to save/load ORC and CSV files to/from ADLS2. It supports the serial method to save and load for ORC data files. However, it supports the serial method to save and the parallel method to load for CSV data files. To save/load the ORC data file to CAS, only the CAS controller requires Azure access token file. To load the CSV files to CAS, the CAS Controller and all Node requires Azure access token file. To save a CAS table as CSV file to ADLS2, only the CAS controller requires Azure access token file. The following pics describe the ORC and CSV data files access from CAS to ADLS2.   CAS supports the serial data load/save method for ADLS2 ORC files:

CAS_Aaccess_to_ADLS2_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

CAS supports the serial data save and the parallel data load for ADLS2 CSV file:

 

CAS_Aaccess_to_ADLS2_2.png

 

 

 

Azure Configurations to access ADLS2 from CAS

Before you can access the ADLS2 storage, you need to create a new or use existing Azure Application and Storage Account. The Azure application and Storage Account require appropriate configurations to access ADLS2 storage. The following options and screenshots describe the required configurations.

 

Azure Application Authentication

The Azure application used by CAS users to access ADLS2 data requires platform configuration with the following options:

 

         
      • Advance Settings
      • Default client type: Treat application as a public client. = YES

CAS_Aaccess_to_ADLS2_5.png

Azure Application API Permission

The Azure application which is used by CAS users to access ADLS2 data requires API permission to access Azure Data Lake and Azure Storage. These permissions are managed and maintained by the Azure cloud administrator (Tenant administrator). You need to request your Azure cloud administrator to grant these permissions for the registered Azure application.

CAS_Aaccess_to_ADLS2_6.png

 

Azure Storage Account configuration

There must be an Azure Storage Account (ADLS2) with "Contributor" and  "Blob Storage Data Contributor" role assigned to required users. The following screenshot describe an Azure Storage Account (ADLS2) named “utkuma1” with Blob Storage Data Contributor role assigned to users. The role can be added just for the immediate resource or can be inherited from the user’s Azure subscription/application role.

CAS_Aaccess_to_ADLS2_7.png

A filesystem at Storage Account container

You require a filesystem created under Storage Account Container to store the data files. The Container is a massively scalable data lake storage. The following screenshot describes a file system under Storage Account with a data file stored under a folder.

CAS_Aaccess_to_ADLS2_8.png

Azure information required to access ADLS2 data from CAS

To access Azure ADLS2 data file from CAS you need the following information to create an adls CASLIB.

 

        accountname='utkuma1'
        filesystem="fsutkuma"
        tenantid="XXXXXXXXXXXXXXXXXXXXXX"
        applicationId="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

 

Azure Storage Account name is the value for the accountname= parameter. The file system name under Storage Account Container is the value for the filesystem= parameter. The Tenant Id and Application Id are available at your Azure Application overview page. You can navigate to your Azure Application overview page by using the following path to obtain Tenant Id and Application Id.  

 

Azure Services -> Azure Active Directory -> App Registration -> ( Select your application) -> Overview

CAS_Aaccess_to_ADLS2_9.png

 

ADLS CASLIB Authentication

An ADLS CASLIB uses device code authentication with Azure. Very first-time access to ADLS CASLIB generates an error message to verify the hardware device code. It requires you to log in to https://microsoft.com/devicelogin Microsoft page and validate the listed device.

 

Note: The default location for the Azure authorization key file is home directory of the user ‘cas’ (/home/cas/.sasadls_.jason).   Error from very first-time access to ADLS CASLIB:

 

ERROR: To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXX to authenticate.
ERROR: Pending end-user authorization.
ERROR: The action stopped due to errors.
 

 

After you log on to listed Microsoft page a series of windows provide further instruction to authorize the device. The Subsequent attempts to access the data succeed and generates an Azure authorization key file saved to the CAS controller server. The home directory of the user “cas” is the default location for the Authorization key with file name name “.sasadls_.json”. The location for the Azure Authorization key can be customized by including cas.AZUREAUTHCACHELOC=”path” in casconfig_usermod.lua configuration file. The specified path should be accessible to the user “cas”.  

 

[cas@intcas01 ~]$ ls -l /home/cas/.sasadls*
-rw-------. 1 cas sas 5718 Feb 13 11:52 /home/cas/.sasadls_2001.json
[cas@intcas01 ~]$

 

For Multi-node CAS environment In a multi-node CAS environment, to load CSV file to CAS, you need to authenticate devices from every CAS node (Unix servers). You can concatenate /home/cas/.sasadls_.jason file at each CAS node to view the separate device code. Use the same Microsoft page to authenticate each device code. Once authenticated, the subsequent CAS load statement succeeds. The Azure authentication key file saved to the default folder at each CAS node. The following example shows the device code from .sasadls_.jason file when very first time CVS data file load statement executed. The file contains the device code just before receiving the Authorization key. The file with device code located at each CAS node(1-5). The Authorization key overwrites the same file.

 

[cas@intcas01 ~]$  more .sasadls_2001.json
{"refresh_token":"","device_code":"AAQABAAEAAABeAFzDwllzTYGDLh_qYbH8ZUUcJrv7qCsAYFeXjEb29MsbYkBHzTSRa6v-DLNaOwtCDt9VvRqSXPNMuKsTxYF9fGLIxhsFrIjZL0gAA","message":
"To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AEGEK2MB4 to authenticate.
","oauth":"","resource":""}

[cas@intcas01 ~]$ ssh sascas02
[cas@intcas02 ~]$ more .sasadls_2001.json
{"refresh_token":"","device_code":"AAQABAAEAAABeAFzDwllzTYGDLh_qYbH8bY1nIfYYv05NmZxCkKXIooCgXKGyJFAozb3uuxWL0HYKGduBxYqUcBaziYgAA","message":
"To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ARVADUQW3 to authenticate.
","oauth":"","resource":""}
[cas@intcas02 ~]$ exit
logout
Connection to sascas02 closed.
…..
…………….
…………….......
[cas@intcas01 ~]$ ssh sascas05
Last login: Thu Feb 13 12:54:06 2020 from sascas01.race.sas.com
[cas@intcas05 ~]$ more .sasadls_2001.json
{"refresh_token":"","device_code":"AAQABAAEAAABeAFzDwllzTYGDLh_qYbH8sxwfsLW_gkKhBchSZL_bLYIEN4Hwave8VyYQ43jJVWXrgCWNE7imCiFajhhyr6wpggAA","message":
"To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code A7G77QXMK to authenticate.
","oauth":"","resource":""}

 

Data load from ADLS2 to CAS

The following code describes the data save/load from ADLS2 Blob storage to CAS. In a CAS session, it creates an “adls” source type CASLIB (session-level). In step-1, it saves CAS tables to ADLS2 storage in ORC and CSV data file format. In step-2, it loads the ORC and CSV data files from ADLS2 storage to CAS.

 

Code:

CAS mySession  SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

/*  CASLIB with adls storage */ 
 caslib ADLS2 datasource=(
      srctype="adls"
      accountname="utkuma1"
      filesystem="fsutkuma"
      dnsSuffix=dfs.core.windows.net
      timeout=50000
      tenantid="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx" 
      applicationId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx"
   )
   path="/testdata"
   subdirs; 

proc casutil incaslib="ADLS2";
   list files ;    
run;
quit; 

/* Step-1 */
proc casutil  incaslib="DM"  outcaslib="ADLS2";
   save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.orc" replace ;
   save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.csv" replace ;
run;
quit;

/* Step-2 */
proc casutil incaslib="ADLS2"  outcaslib="ADLS2";
   	load casdata="DM_FACT_MEGA_CORP_1G.orc" casout="DM_FACT_MEGA_CORP_1G_ORC" replace ;
   	load casdata="DM_FACT_MEGA_CORP_1G.csv" casout="DM_FACT_MEGA_CORP_1G_CSV" replace ;
    list tables ;
run;
quit;

CAS mySession  TERMINATE;

 

Log extract:

 

…………
…………
83   proc casutil  incaslib="DM"  outcaslib="ADLS2";
NOTE: The UUID '20318d8b-fe8c-b849-9f32-1cb4ac75f0b0' is connected using session MYSESSION.
84      save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.orc" replace ;
NOTE: Executing action 'table.save'.
NOTE: Cloud Analytic Services saved the file DM_FACT_MEGA_CORP_1G.orc in caslib ADLS2.
……….
……………
85      save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.csv" replace ;
NOTE: Executing action 'table.save'.
NOTE: Cloud Analytic Services saved the file DM_FACT_MEGA_CORP_1G.csv in caslib ADLS2.
….
…………….
83   proc casutil incaslib="ADLS2"  outcaslib="ADLS2";
NOTE: The UUID '5fcb51ee-020b-1349-b0b4-2bbff70f1605' is connected using session MYSESSION.
84      
84 !     load casdata="DM_FACT_MEGA_CORP_1G.orc" casout="DM_FACT_MEGA_CORP_1G_ORC" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the external data from DM_FACT_MEGA_CORP_1G.orc available as table DM_FACT_MEGA_CORP_1G_ORC in caslib ADLS2.
…………
………………
85 !     load casdata="DM_FACT_MEGA_CORP_1G.csv" casout="DM_FACT_MEGA_CORP_1G_CSV" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the external data from DM_FACT_MEGA_CORP_1G.csv available as table DM_FACT_MEGA_CORP_1G_CSV in  caslib ADLS2.
….
……………
        

   

The following screenshot describes ORC and CSV data file saved to Azure ADLS2 by executing above statement.

CAS_Aaccess_to_ADLS2_10.png

 

Data Compression

The SAS ORC Data connector supports data compression while saving data to ADLS2. By default, “ZLIB” is used to compress the data when no values is specified with COMPRESS= in CASLIB statement.

        COMPRESS=YES | NO | ZLIB

 

CAS load performance

The data load performance between CAS and ADLS2 depends on CAS resources and network traffic speed. The following test results are from an environment where CAS hosted at RACE servers accessing SAS federated Azure environment.  

 

Test environment: RACE CAS Servers = 1 + 4 Nodes – 32 GB Mem with 4 CPU on each node.  

 

Run time for ORC file (CAS support serial data save/load):

 

CAS_Aaccess_to_ADLS2_11.png

 

Run time for CSV file (CAS support Serial data save and Parallel data load):

 

CAS_Aaccess_to_ADLS2_12.png

 

Issues When Converting Data from ORC to CAS

Character-based columns (CHAR, VARCHAR, STRING) that contain date, time, or timestamp values are not automatically converted to their respective SAS format types. They remain as character-string values.

 

Important Links:

Azure Data Lake Storage Data Source

ORC Data Connector

CAS Configuration File Options Reference (cas.AZUREAUTHCACHELOC=)  

 

Comments

Hi,

 

I've hit an issue with this, when doing a listing of the files on the data lake. It only seems to list the first 500 files it find on the data lake.

Is there any options when running proc cas or casutil to list all files?

 

Thanks

 

Andy

Hi Uttam,

Thanks for sharing this article to connect ADLS2 from Viya. Currently I am using Viya 4.0 through which I am trying to access ADLS2 data which is not successful yet and I have to make changes in ADLS Service Principal as suggested in this article. I want to achieve this changes with a pipeline through YAML file. Can you please share an yaml file, if you have already configured these changes into yaml file.

Version history
Last update:
‎03-17-2022 01:28 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags