Amazon S3 is one of the most widely used platforms to store big data. It’s a secure, reliable, scalable, and affordable environment for storing huge data. Users can store various formats of a data files on S3 locations from different applications. This post is about how to read various data files stored on S3 locations using AWS Athena to SAS and CAS.
Amazon Web Services Athena is a service which enables a user to perform interactive queries on data files stored in S3. AWS Athena is a serverless query service, the user does not need to manage any underlying compute infrastructure unlike AWS EMR Hadoop cluster. A user can access Athena through either AWS Management console, API or JDBC driver. User needs to define a schema for the data file and can start executing SQL queries on S3 data. The schema for the S3-data files created and stored under AWS Glue catalog.
Data file format supported by Athena Query:
A user only pays for the query executed on S3 data files. The charges are based on the amount of data scanned by each query. As per AWS documentation, the user pays $5.00 per TB of data scanned.
When metadata/schema for data files are available in AWS Glue catalog, BASE SAS can access Athena (Glue) S3 table using JDBC connection. CAS can access Athena (Glue) S3 table using JDBC Data Connector. Means, user can use serial data load method to load an Athena table to CAS.
The following diagram describes the data access from BASE SAS to S3 via Athena.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The following diagram describes the data access from CAS to S3 via Athena.
The following method describes a step-by-step process to access S3 files from CAS and BASE SAS via Athena.
The following screen-shot describes an S3 bucket and folder having Parquet files and needs to be read into SAS and CAS using the following steps.
S3 Bucket and folder with Parquet file:
To create and store metadata for S3 data file, a user needs to create a database under Glue data catalog. User can switch to Glue catalog application from Athena query editor dashboard. On AWS Glue dashboard, under database section user can add a new database.
Option to switch to AWS Glue Data Catalog:
Creating a new database under Glue data catalog:
To read a data file stored on S3, the user must know the file structure to formulate a create table statement. The following SQL statement can be used to create a table under Glue database catalog for above S3 Parquet file. The SQL executed from Athena query editor. Notice the data file location and row format SerDe used to read the file.
CREATE EXTERNAL TABLE If NOT EXISTS baseball_athena_S3( name varchar(18), team varchar(14), natbat double, nhits double, nhome double, nruns double, nrbi double, nbb double, yrmajor double, cratbat double, crhits double, crhome double, crruns double, crrbi double, crbb double, league varchar(8), division varchar(8), position varchar(8), nouts double, nassts double, nerror double, salary double, div_1 varchar(16), logsalary double ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('serialization.format' = '\u0001', 'field.delim' = '\u0001' ) LOCATION 's3://sas-viyadeploymentworkshop/gel/Data/baseball_prqt/' TBLPROPERTIES ('has_encrypted_data'='false');
The following screen-shot describes the SQL execution at Athena SQL editor to create a table under a specific database to read S3 data files.
The following screen-shot describes the data read from S3 datafile under Athena query using schema/table stored in Glue catalog.
select * from baseball_athena_s3 limit 10 ;
SQL executed to verify the data read:
With SAS/ACCESS interface to JDBC installed and valid AWS Access keys placed at SAS compute server, you can access the Athena table by executing the following statement from SAS Studio application.
The Athena JDBC LIBNAME statement requires AWS credentials saved under the user’s home directory ~/.aws at SAS compute server. For example, if the user ‘viyademo01’ is executing the LIBNAME statement then the AWS credential must reside under /home/viyademo01/.aws folder.
Notice the classpath= variable where Athena JDBC driver resides at SAS Compute server.
options sastrace=',,,d' sastraceloc=saslog ; libname athena jdbc driverclass="com.simba.athena.jdbc.Driver" url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://sas-viyadeploymentworkshop/gel/Data/Athena/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=default;Schema=gel_test" classpath="/opt/sas/viya/config/data/JDBCDriver" schema="gel_test" ; proc print data=athena.baseball_athena_s3(obs=10) ; run ;
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 75 22 1568319371 no_name 0 PRINT JDBC_5: Prepared: on connection 0 23 1568319371 no_name 0 PRINT SELECT * FROM BASEBALL_ATHENA_S3 WHERE 0=1 24 1568319371 no_name 0 PRINT 25 1568319371 no_name 0 PRINT 76 proc print data=athena.baseball_athena_s3(obs=10) ; 77 run ; 26 1568319381 no_name 0 PRINT JDBC_6: Executed: on connection 0 27 1568319381 no_name 0 PRINT SELECT * FROM BASEBALL_ATHENA_S3 28 1568319381 no_name 0 PRINT 29 1568319381 no_name 0 PRINT NOTE: The PROCEDURE PRINT printed page 42.
With JDBC Data Connector installed and valid AWS Access keys placed at CAS Controller server, you can access the Athena table by executing the following statement from SAS Studio application.
The Athena JDBC CASLIB statement requires AWS credentials saved under ‘cas’ user home directory /home/cas/.aws at CAS Controller server, even though CASLIB statement executed by user ‘viyademo01’.
Notice the classpath= variable where Athena JDBC driver resides at CAS Controller server.
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); caslib jdcaslib dataSource=(srctype='jdbc', url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://sas-viyadeploymentworkshop/gel/Data/Athena/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=default;Schema=gel_test", class="com.simba.athena.jdbc.Driver", classpath="/opt/sas/viya/config/data/JDBCDriver", schema="gel_test" ); proc casutil; load casdata="baseball_athena_s3" casout="baseball_athena_s3" outcaslib="jdcaslib" incaslib="jdcaslib" ; quit; proc casutil incaslib="jdcaslib" outcaslib="jdcaslib"; list tables; quit; CAS mySession TERMINATE;
………… …….. 77 78 caslib jdcaslib dataSource=(srctype='jdbc', 79 url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://sas-viyadeploymentworkshop/gel/Data/Athena/; 79 ! AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider; 79 ! AwsCredentialsProviderArguments=default;Schema=gel_test", 80 class="com.simba.athena.jdbc.Driver", 81 classpath="/opt/sas/viya/config/data/JDBCDriver", 82 schema="gel_test" ); NOTE: Executing action 'table.addCaslib'. NOTE: 'JDCASLIB' is now the active caslib. NOTE: Cloud Analytic Services added the caslib 'JDCASLIB'. NOTE: Action 'table.addCaslib' used (Total process time): 83 84 proc casutil; NOTE: The UUID '696b6e87-073f-3b4b-a129-227218fbb7a3' is connected using session MYSESSION. 85 85 ! load casdata="baseball_athena_s3" casout="baseball_athena_s3" replace 86 outcaslib="jdcaslib" incaslib="jdcaslib" ; NOTE: Executing action 'table.loadTable'. NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC. NOTE: Cloud Analytic Services made the external data from baseball_athena_s3 available as table BASEBALL_ATHENA_S3 in caslib jdcaslib. ….. ……………..
To access S3 data files having various file format supported by Athena, you must know the structure of the file and SerDe to use to read the file. You need to create a schema under Glue catalog for the S3 data files. The CAS and SAS can access the Athena table with valid AWS access key using JDBC connection. The CAS and SAS can only read the data from the Athena table with the S3 data file. The data save from CAS and SAS to S3 data file via Athena table is not supported.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.