SAS Viya/CAS accessing AWS Athena

Amazon AWS Athena is an interactive query service to view and analyze data from data files stored at S3 storage. The SAS Viya user can access the AWS Athena service to load CAS using the serial and multi-node mechanism. Various types of data files (Avro, CSV, Json, XML, Parquet, ORC, etc. ) located at S3 can be loaded to CAS using AWS Athena. This blog post is about loading the CAS table from AWS Athena.

Athena Pricing

A user only pays for the query executed on S3 data files. The charges are based on the amount of data scanned by each query. As per AWS doc user pays $5.00 per TB of data scanned. Athena Pricing

Data access path

The SAS Viya user can access the AWS Athena service to load CAS using the serial and multinode mechanism with the JDBC data connector. The following diagram describes the data access path from CAS to AWS Athena.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Pre-requisites

A schema in the glue catalog database.
JDBC Data Connector on CAS Nodes.
AWS Access keys available to CAS controller and node Pods via NFS mount.
Read/write permission to the S3 bucket.
AWS Athena JDBC Driver is available to CAS Pods.

The following screenshot describes an S3 bucket and folder having Parquet files that need to be loaded into CAS using AWS Athena.

Metadata creation

Before you can use the data from AWS Athena services, need to create the metadata for the data files stored at the S3 location. The metadata/schema for data files can be stored in the native AWS Glue catalog or another database. CAS can access Athena (Glue) S3 table using JDBC Data Connector. To read a data file stored on S3, the user must know the file structure to formulate a create table statement. The following SQL statement can be used to create a table under the Glue database catalog for the above S3 Parquet file. The SQL can be executed from the Athena query editor. Notice the data file location and row format SerDe used to read the file.

SQL Code:

CREATE EXTERNAL TABLE IF NOT EXISTS baseball_s3 (
  name varchar(18),
  team varchar(14),
  natbat double,
  nhits double,
  nhome double,
  nruns double,
  nrbi double,
  nbb double,
  yrmajor double,
  cratbat double,
  crhits double,
  crhome double,
  crruns double,
  crrbi double,
  crbb double,
  league varchar(8),
  division varchar(8),
  position varchar(8),
  nouts double,
  nassts double,
  nerror double,
  salary double,
  div_1 varchar(16),
  logsalary double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '\u0001', 'field.delim' = '\u0001' )
LOCATION 's3://utkumadmviya4/data/PARQUET/baseball.parquet/'
TBLPROPERTIES ('has_encrypted_data'='false');

CAS load from AWS Athena table

Serial data load to CAS from Athena table

With JDBC Data Connector and valid AWS Access key available to CAS Controller, you can load CAS from Athena table with the serial method by executing the following statement at SAS Studio application.

The CAS PODs system environment variable AWS_CREDENTIAL_PROFILES_FILE=/mnt/viya-share/config/access-key/credentials is set using sas-access.properties config file. This environment variable let you specify the custom location for the AWS credential file. It enables the user process to look for the AWS credentials file at a specified location rather than the default ~/.aws/credentials file.

Notice the classpath= variable where Athena JDBC driver resides and is available to CAS Controller.

Code:

%let userid=utkuma;

%let schema=&userid;
options sastrace=',,,d' sastraceloc=saslog ;

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib jdcaslib dataSource=(srctype='jdbc',
class="com.simba.athena.jdbc.Driver",
url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
classpath="/mnt/viya-share/config/access-clients/jdbc",
schema=&schema
);

/* load  Athena tables to CAS*/
proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
	load casdata="baseball_s3" casout="baseball_s3"  replace;
	list tables;
quit;

CAS mySession  TERMINATE;

Log extract :

....
..............
85   caslib jdcaslib dataSource=(srctype='jdbc',
86   class="com.simba.athena.jdbc.Driver",
87   url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;
87 ! AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;
87 ! AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
88   classpath="/mnt/viya-share/config/access-clients/jdbc",
89   schema=&schema
90   );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'JDCASLIB' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'JDCASLIB'.
NOTE: Action 'table.addCaslib' used (Total process time):
…………
……..
92   /* load a Athena table to CAS*/
93   proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: The UUID 'ac88839b-aeb9-4c45-ab06-d2b942085b77' is connected using session MYSESSION.
94
94 !  load casdata="baseball_s3" casout="baseball_s3"  replace;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
NOTE: Cloud Analytic Services made the external data from baseball_s3 available as table BASEBALL_S3 in caslib jdcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
..............
.................

Result Output: CAS table loaded from Athena table.

Multinode data load to CAS from Athena table

With JDBC Data Connector and valid AWS Access key available to CAS Controller, you can load CAS from Athena table with Multi-Node method by executing the following statement at SAS Studio application. The CAS PODs system environment variable AWS_CREDENTIAL_PROFILES_FILE=/mnt/viya-share/config/access-key/credentials is set using sas-access.properties config file.

Notice the classpath= variable where Athena JDBC driver resides and is available to CAS Controller.

Code:

%let userid=utkuma;

%let schema=&userid;
options sastrace=',,,d' sastraceloc=saslog ;

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib jdcaslib dataSource=(srctype='jdbc',
class="com.simba.athena.jdbc.Driver",
url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
classpath="/mnt/viya-share/config/access-clients/jdbc",
schema=&schema,
numreadnodes=10
);


/* load a Athena table to CAS using Multinode Method */

proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
 load casdata="baseball_s3" casout="baseball_cas" options=(sliceColumn="salary") replace;;
 list tables;
quit ;

proc cas ;
   action table.tabledetails / name="baseball_cas" caslib="jdcaslib" level="node" ;
quit;


CAS mySession  TERMINATE;

Log extract :

....
..............
93   /* load a Athena table to CAS*/
94   proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: The UUID 'dd2c8b9c-6459-8a4c-9b89-a645e4503b53' is connected using session MYSESSION.
95    load casdata="baseball_s3" casout="baseball_cas" options=(sliceColumn="salary") replace;;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(3). The load will proceed with numReadNodes=3.
NOTE: Cloud Analytic Services made the external data from baseball_s3 available as table BASEBALL_CAS in caslib jdcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
..............
.................

Result Output:

CAS table loaded from Athena table.

Important Links:

SAS Viya JDBC Data Connector

Amazon AWS Athena User Guide

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library