Amazon AWS Athena is an interactive query service to view and analyze data from data files stored at S3 storage. The SAS Viya user can access the AWS Athena service to load CAS using the serial and multi-node mechanism. Various types of data files (Avro, CSV, Json, XML, Parquet, ORC, etc. ) located at S3 can be loaded to CAS using AWS Athena. This blog post is about loading the CAS table from AWS Athena.
A user only pays for the query executed on S3 data files. The charges are based on the amount of data scanned by each query. As per AWS doc user pays $5.00 per TB of data scanned. Athena Pricing
The SAS Viya user can access the AWS Athena service to load CAS using the serial and multinode mechanism with the JDBC data connector. The following diagram describes the data access path from CAS to AWS Athena.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The following screenshot describes an S3 bucket and folder having Parquet files that need to be loaded into CAS using AWS Athena.
Before you can use the data from AWS Athena services, need to create the metadata for the data files stored at the S3 location. The metadata/schema for data files can be stored in the native AWS Glue catalog or another database. CAS can access Athena (Glue) S3 table using JDBC Data Connector. To read a data file stored on S3, the user must know the file structure to formulate a create table statement. The following SQL statement can be used to create a table under the Glue database catalog for the above S3 Parquet file. The SQL can be executed from the Athena query editor. Notice the data file location and row format SerDe used to read the file.
SQL Code:
CREATE EXTERNAL TABLE IF NOT EXISTS baseball_s3 (
name varchar(18),
team varchar(14),
natbat double,
nhits double,
nhome double,
nruns double,
nrbi double,
nbb double,
yrmajor double,
cratbat double,
crhits double,
crhome double,
crruns double,
crrbi double,
crbb double,
league varchar(8),
division varchar(8),
position varchar(8),
nouts double,
nassts double,
nerror double,
salary double,
div_1 varchar(16),
logsalary double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '\u0001', 'field.delim' = '\u0001' )
LOCATION 's3://utkumadmviya4/data/PARQUET/baseball.parquet/'
TBLPROPERTIES ('has_encrypted_data'='false');
With JDBC Data Connector and valid AWS Access key available to CAS Controller, you can load CAS from Athena table with the serial method by executing the following statement at SAS Studio application.
The CAS PODs system environment variable AWS_CREDENTIAL_PROFILES_FILE=/mnt/viya-share/config/access-key/credentials is set using sas-access.properties config file. This environment variable let you specify the custom location for the AWS credential file. It enables the user process to look for the AWS credentials file at a specified location rather than the default ~/.aws/credentials file.
Notice the classpath= variable where Athena JDBC driver resides and is available to CAS Controller.
Code:
%let userid=utkuma;
%let schema=&userid;
options sastrace=',,,d' sastraceloc=saslog ;
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib jdcaslib dataSource=(srctype='jdbc',
class="com.simba.athena.jdbc.Driver",
url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
classpath="/mnt/viya-share/config/access-clients/jdbc",
schema=&schema
);
/* load Athena tables to CAS*/
proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
load casdata="baseball_s3" casout="baseball_s3" replace;
list tables;
quit;
CAS mySession TERMINATE;
Log extract :
....
..............
85 caslib jdcaslib dataSource=(srctype='jdbc',
86 class="com.simba.athena.jdbc.Driver",
87 url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;
87 ! AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;
87 ! AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
88 classpath="/mnt/viya-share/config/access-clients/jdbc",
89 schema=&schema
90 );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'JDCASLIB' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'JDCASLIB'.
NOTE: Action 'table.addCaslib' used (Total process time):
…………
……..
92 /* load a Athena table to CAS*/
93 proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: The UUID 'ac88839b-aeb9-4c45-ab06-d2b942085b77' is connected using session MYSESSION.
94
94 ! load casdata="baseball_s3" casout="baseball_s3" replace;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
NOTE: Cloud Analytic Services made the external data from baseball_s3 available as table BASEBALL_S3 in caslib jdcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
..............
.................
Result Output: CAS table loaded from Athena table.
With JDBC Data Connector and valid AWS Access key available to CAS Controller, you can load CAS from Athena table with Multi-Node method by executing the following statement at SAS Studio application. The CAS PODs system environment variable AWS_CREDENTIAL_PROFILES_FILE=/mnt/viya-share/config/access-key/credentials is set using sas-access.properties config file.
Notice the classpath= variable where Athena JDBC driver resides and is available to CAS Controller.
Code:
%let userid=utkuma;
%let schema=&userid;
options sastrace=',,,d' sastraceloc=saslog ;
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib jdcaslib dataSource=(srctype='jdbc',
class="com.simba.athena.jdbc.Driver",
url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gel-athena-workspace/tmp/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=182696677754-testers;Schema=&schema",
classpath="/mnt/viya-share/config/access-clients/jdbc",
schema=&schema,
numreadnodes=10
);
/* load a Athena table to CAS using Multinode Method */
proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
load casdata="baseball_s3" casout="baseball_cas" options=(sliceColumn="salary") replace;;
list tables;
quit ;
proc cas ;
action table.tabledetails / name="baseball_cas" caslib="jdcaslib" level="node" ;
quit;
CAS mySession TERMINATE;
Log extract :
....
..............
93 /* load a Athena table to CAS*/
94 proc casutil outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: The UUID 'dd2c8b9c-6459-8a4c-9b89-a645e4503b53' is connected using session MYSESSION.
95 load casdata="baseball_s3" casout="baseball_cas" options=(sliceColumn="salary") replace;;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(3). The load will proceed with numReadNodes=3.
NOTE: Cloud Analytic Services made the external data from baseball_s3 available as table BASEBALL_CAS in caslib jdcaslib.
NOTE: Action 'table.loadTable' used (Total process time):
..............
.................
Result Output:
CAS table loaded from Athena table.
Important Links:
Find more articles from SAS Global Enablement and Learning here.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.