BookmarkSubscribeRSS Feed

SAS Viya/CAS accessing AWS Athena

Started ‎05-11-2022 by
Modified ‎05-11-2022 by
Views 3,374

Amazon AWS Athena is an interactive query service to view and analyze data from data files stored at S3 storage. The SAS Viya user can access the AWS Athena service to load CAS using the serial and multi-node mechanism. Various types of data files (Avro, CSV, Json, XML, Parquet, ORC, etc. ) located at S3 can be loaded to CAS using AWS Athena. This blog post is about loading the CAS table from AWS Athena.

 

Athena Pricing

 

A user only pays for the query executed on S3 data files. The charges are based on the amount of data scanned by each query. As per AWS doc user pays $5.00 per TB of data scanned. Athena Pricing

 

Data access path

 

The SAS Viya user can access the AWS Athena service to load CAS using the serial and multinode mechanism with the JDBC data connector. The following diagram describes the data access path from CAS to AWS Athena.

 

uk_1_CAS_Accessing__AWS_Athena_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Pre-requisites

  • A schema in the glue catalog database.
  • JDBC Data Connector on CAS Nodes.
  • AWS Access keys available to CAS controller and node Pods via NFS mount.
  • Read/write permission to the S3 bucket.
  • AWS Athena JDBC Driver is available to CAS Pods.

 

The following screenshot describes an S3 bucket and folder having Parquet files that need to be loaded into CAS using AWS Athena.

 

uk_2_CAS_Accessing__AWS_Athena_2.png

 

Metadata creation

 

Before you can use the data from AWS Athena services, need to create the metadata for the data files stored at the S3 location. The metadata/schema for data files can be stored in the native AWS Glue catalog or another database. CAS can access Athena (Glue) S3 table using JDBC Data Connector. To read a data file stored on S3, the user must know the file structure to formulate a create table statement. The following SQL statement can be used to create a table under the Glue database catalog for the above S3 Parquet file. The SQL can be executed from the Athena query editor. Notice the data file location and row format SerDe used to read the file.

 

SQL Code:

 

CREATE EXTERNAL TABLE IF NOT EXISTS baseball_s3 (
  name varchar(18),
  team varchar(14),
  natbat double,
  nhits double,
  nhome double,
  nruns double,
  nrbi double,
  nbb double,
  yrmajor double,
  cratbat double,
  crhits double,
  crhome double,
  crruns double,
  crrbi double,
  crbb double,
  league varchar(8),
  division varchar(8),
  position varchar(8),
  nouts double,
  nassts double,
  nerror double,
  salary double,
  div_1 varchar(16),
  logsalary double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '\u0001', 'field.delim' = '\u0001' )
LOCATION 's3://utkumadmviya4/data/PARQUET/baseball.parquet/'
TBLPROPERTIES ('has_encrypted_data'='false');

 

CAS load from AWS Athena table

 

Serial data load to CAS from Athena table

 

With JDBC Data Connector and valid AWS Access key available to CAS Controller, you can load CAS from Athena table with the serial method by executing the following statement at SAS Studio application.

 

The CAS PODs system environment variable AWS_CREDENTIAL_PROFILES_FILE=/mnt/viya-share/config/access-key/credentials is set using sas-access.properties config file. This environment variable let you specify the custom location for the AWS credential file. It enables the user process to look for the AWS credentials file at a specified location rather than the default ~/.aws/credentials file.

 

Notice the classpath= variable where Athena JDBC driver resides and is available to CAS Controller.

 

Code:

 

%let userid=utkuma;

%let schema=&userid;
options sastrace=',,,d' sastraceloc=saslog ;

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib jdcaslib dataSource=(srctype='jdbc',
class="com.simba.athena.jdbc.Driver",
url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
classpath="/mnt/viya-share/config/access-clients/jdbc",
schema=&schema
);

/* load  Athena tables to CAS*/
proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
	load casdata="baseball_s3" casout="baseball_s3"  replace;
	list tables;
quit;

CAS mySession  TERMINATE;

 

Log extract :

 

....
..............
85   caslib jdcaslib dataSource=(srctype='jdbc',
86   class="com.simba.athena.jdbc.Driver",
87   url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;
87 ! AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;
87 ! AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
88   classpath="/mnt/viya-share/config/access-clients/jdbc",
89   schema=&schema
90   );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'JDCASLIB' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'JDCASLIB'.
NOTE: Action 'table.addCaslib' used (Total process time):
…………
……..
92   /* load a Athena table to CAS*/
93   proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: The UUID 'ac88839b-aeb9-4c45-ab06-d2b942085b77' is connected using session MYSESSION.
94
94 !  load casdata="baseball_s3" casout="baseball_s3"  replace;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
NOTE: Cloud Analytic Services made the external data from baseball_s3 available as table BASEBALL_S3 in caslib jdcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
..............
.................

 

 Result Output: CAS table loaded from Athena table.

 

uk_3_CAS_Accessing__AWS_Athena_3.png

 

Multinode data load to CAS from Athena table

 

With JDBC Data Connector and valid AWS Access key available to CAS Controller, you can load CAS from Athena table with Multi-Node method by executing the following statement at SAS Studio application. The CAS PODs system environment variable AWS_CREDENTIAL_PROFILES_FILE=/mnt/viya-share/config/access-key/credentials is set using sas-access.properties config file.

 

Notice the classpath= variable where Athena JDBC driver resides and is available to CAS Controller.

 

Code:

 

%let userid=utkuma;

%let schema=&userid;
options sastrace=',,,d' sastraceloc=saslog ;

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib jdcaslib dataSource=(srctype='jdbc',
class="com.simba.athena.jdbc.Driver",
url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
classpath="/mnt/viya-share/config/access-clients/jdbc",
schema=&schema,
numreadnodes=10
);


/* load a Athena table to CAS using Multinode Method */

proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
 load casdata="baseball_s3" casout="baseball_cas" options=(sliceColumn="salary") replace;;
 list tables;
quit ;

proc cas ;
   action table.tabledetails / name="baseball_cas" caslib="jdcaslib" level="node" ;
quit;


CAS mySession  TERMINATE;

 

Log extract :

 

....
..............
93   /* load a Athena table to CAS*/
94   proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: The UUID 'dd2c8b9c-6459-8a4c-9b89-a645e4503b53' is connected using session MYSESSION.
95    load casdata="baseball_s3" casout="baseball_cas" options=(sliceColumn="salary") replace;;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(3). The load will proceed with numReadNodes=3.
NOTE: Cloud Analytic Services made the external data from baseball_s3 available as table BASEBALL_CAS in caslib jdcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
..............
.................

 

Result Output:

 

CAS table loaded from Athena table.

   

uk_4_CAS_Accessing__AWS_Athena_4.png

 

Important Links:

 

SAS Viya JDBC Data Connector

 

Amazon AWS Athena User Guide  

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎05-11-2022 03:14 PM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags