BookmarkSubscribeRSS Feed

Reading S3 (Avro, CSV, JSON, XML, Parquet, ORC) files to CAS and SAS via AWS Athena

Started ‎10-17-2019 by
Modified ‎10-17-2019 by
Views 7,856

Amazon S3 is one of the most widely used platforms to store big data. It’s a secure, reliable, scalable, and affordable environment for storing huge data. Users can store various formats of a data files on S3 locations from different applications. This post is about how to read various data files stored on S3 locations using AWS Athena to SAS and CAS.

Athena

Amazon Web Services Athena is a service which enables a user to perform interactive queries on data files stored in S3. AWS Athena is a serverless query service, the user does not need to manage any underlying compute infrastructure unlike AWS EMR Hadoop cluster. A user can access Athena through either AWS Management console, API or JDBC driver. User needs to define a schema for the data file and can start executing SQL queries on S3 data. The schema for the S3-data files created and stored under AWS Glue catalog.

 

Data file format supported by Athena Query:

  • Avro
  • CSV
  • JASON
  • XML
  • Parquet
  • ORC

Pricing

A user only pays for the query executed on S3 data files. The charges are based on the amount of data scanned by each query. As per AWS documentation, the user pays $5.00 per TB of data scanned.

CAS and SAS Access to S3 data files via Athena

When metadata/schema for data files are available in AWS Glue catalog, BASE SAS can access Athena (Glue) S3 table using JDBC connection. CAS can access Athena (Glue) S3 table using JDBC Data Connector. Means, user can use serial data load method to load an Athena table to CAS.

 

The following diagram describes the data access from BASE SAS to S3 via Athena.

 

Read_S3_AWS_Athena_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The following diagram describes the data access from CAS to S3 via Athena.

 

Read_S3_AWS_Athena_2.png

 

The following method describes a step-by-step process to access S3 files from CAS and BASE SAS via Athena.

Prerequisites

  • SAS/ACCESS Interface to JDBC installed on Viya (SAS) compute server.
  • JDBC Data Connector installed on CAS servers.
  • A valid AWS Access Key to access S3 bucket.
  • Read permission on S3 bucket.
  • AWS Athena JDBC Driver available on Viya (SAS) compute and CAS servers.

 

The following screen-shot describes an S3 bucket and folder having Parquet files and needs to be read into SAS and CAS using the following steps.

 

S3 Bucket and folder with Parquet file:

 

Read_S3_AWS_Athena_3.png

Steps

1. Create a database in AWS Glue Data catalog.

To create and store metadata for S3 data file, a user needs to create a database under Glue data catalog. User can switch to Glue catalog application from Athena query editor dashboard. On AWS Glue dashboard, under database section user can add a new database.

 

Option to switch to AWS Glue Data Catalog:

 

Read_S3_AWS_Athena_4.png

 

Creating a new database under Glue data catalog:

 

Read_S3_AWS_Athena_5.png

2. Create metadata/table for S3 datafiles under Glue catalog database.

To read a data file stored on S3, the user must know the file structure to formulate a create table statement. The following SQL statement can be used to create a table under Glue database catalog for above S3 Parquet file. The SQL executed from Athena query editor. Notice the data file location and row format SerDe used to read the file.

 

CREATE EXTERNAL TABLE If NOT EXISTS baseball_athena_S3( 
  name varchar(18),
  team varchar(14),
  natbat double,
  nhits double,
  nhome double,
  nruns double,
  nrbi double,
  nbb double,
  yrmajor double,
  cratbat double,
  crhits double,
  crhome double,
  crruns double,
  crrbi double,
  crbb double,
  league varchar(8),
  division varchar(8),
  position varchar(8),
  nouts double,
  nassts double,
  nerror double,
  salary double,
  div_1 varchar(16),
  logsalary double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '\u0001', 'field.delim' = '\u0001' )
LOCATION 's3://sas-viyadeploymentworkshop/gel/Data/baseball_prqt/' 
TBLPROPERTIES ('has_encrypted_data'='false');

 

The following screen-shot describes the SQL execution at Athena SQL editor to create a table under a specific database to read S3 data files.

 

Read_S3_AWS_Athena_6.png

3. Verify Athena query reads S3 data files with schema in Glue catalog database.

The following screen-shot describes the data read from S3 datafile under Athena query using schema/table stored in Glue catalog.

 

SQL:

 

select * from  baseball_athena_s3 limit 10 ;

 

SQL executed to verify the data read:

 

Read_S3_AWS_Athena_7.png

4. BASE SAS access to S3 data file via Athena.

With SAS/ACCESS interface to JDBC installed and valid AWS Access keys placed at SAS compute server, you can access the Athena table by executing the following statement from SAS Studio application.

 

The Athena JDBC LIBNAME statement requires AWS credentials saved under the user’s home directory ~/.aws at SAS compute server. For example, if the user ‘viyademo01’ is executing the LIBNAME statement then the AWS credential must reside under /home/viyademo01/.aws folder.

 

Notice the classpath= variable where Athena JDBC driver resides at SAS Compute server.

 

Code:

 

options sastrace=',,,d' sastraceloc=saslog ;
		
libname athena jdbc driverclass="com.simba.athena.jdbc.Driver"
		   url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://sas-viyadeploymentworkshop/gel/Data/Athena/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=default;Schema=gel_test"
		   classpath="/opt/sas/viya/config/data/JDBCDriver" schema="gel_test" ;
		
proc print data=athena.baseball_athena_s3(obs=10) ;
run ;

 

Log:

 

1     OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
75    
  22 1568319371 no_name 0 PRINT   
JDBC_5: Prepared: on connection 0 23 1568319371 no_name 0 PRINT   
SELECT * FROM BASEBALL_ATHENA_S3 WHERE 0=1 24 1568319371 no_name 0 PRINT   
  25 1568319371 no_name 0 PRINT   
76    proc print data=athena.baseball_athena_s3(obs=10) ;
77    run ;
  26 1568319381 no_name 0 PRINT   
JDBC_6: Executed: on connection 0 27 1568319381 no_name 0 PRINT   
SELECT * FROM BASEBALL_ATHENA_S3 28 1568319381 no_name 0 PRINT   
  29 1568319381 no_name 0 PRINT   
NOTE: The PROCEDURE PRINT printed page 42.

 

Results:

 

Read_S3_AWS_Athena_8.png

5. Serial data load to CAS from S3 files via Athena.

With JDBC Data Connector installed and valid AWS Access keys placed at CAS Controller server, you can access the Athena table by executing the following statement from SAS Studio application.

 

The Athena JDBC CASLIB statement requires AWS credentials saved under ‘cas’ user home directory /home/cas/.aws at CAS Controller server, even though CASLIB statement executed by user ‘viyademo01’.

 

Notice the classpath= variable where Athena JDBC driver resides at CAS Controller server.

 

Code:

 

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib jdcaslib dataSource=(srctype='jdbc',
           url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://sas-viyadeploymentworkshop/gel/Data/Athena/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;AwsCredentialsProviderArguments=default;Schema=gel_test",
           class="com.simba.athena.jdbc.Driver",
           classpath="/opt/sas/viya/config/data/JDBCDriver",
           schema="gel_test" );

proc casutil;
	load casdata="baseball_athena_s3"  casout="baseball_athena_s3" 
	outcaslib="jdcaslib" incaslib="jdcaslib" ; 
quit;

proc casutil incaslib="jdcaslib" outcaslib="jdcaslib";
   list tables; 
quit;

CAS mySession  TERMINATE;

 

Log extract:

 

…………
……..
77    
78    caslib jdcaslib dataSource=(srctype='jdbc',
79               url="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://sas-viyadeploymentworkshop/gel/Data/Athena/;
79  ! AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider;
79  ! AwsCredentialsProviderArguments=default;Schema=gel_test",
80               class="com.simba.athena.jdbc.Driver",
81               classpath="/opt/sas/viya/config/data/JDBCDriver",
82               schema="gel_test" );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'JDCASLIB' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'JDCASLIB'.
NOTE: Action 'table.addCaslib' used (Total process time):
83    
84    proc casutil;
NOTE: The UUID '696b6e87-073f-3b4b-a129-227218fbb7a3' is connected using session MYSESSION.
85    
85  !  load casdata="baseball_athena_s3" casout="baseball_athena_s3"  replace
86    outcaslib="jdcaslib" incaslib="jdcaslib" ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to JDBC.
NOTE: Cloud Analytic Services made the external data from baseball_athena_s3 available as table BASEBALL_ATHENA_S3 in caslib 
      jdcaslib.
…..
……………..

 

Result extract:

 

Read_S3_AWS_Athena_9.png

Summary

To access S3 data files having various file format supported by Athena, you must know the structure of the file and SerDe to use to read the file. You need to create a schema under Glue catalog for the S3 data files. The CAS and SAS can access the Athena table with valid AWS access key using JDBC connection. The CAS and SAS can only read the data from the Athena table with the S3 data file. The data save from CAS and SAS to S3 data file via Athena table is not supported.

Related Resource

Amazon AWS Athena User Guide

Comments

Hi 

I am trying to follow this and read files in S3 bucket using the capability of Athena  but facing issues.

The files in S3 are in multipart parquet format.  Here in this example you have created a table and reading that from SAS, which i understand . however in my case reading a table having parquet format files are not working.

Do you have any idea on how to get this working please?. It complains about the below error

ERROR: Prepare error: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. SYNTAX_ERROR: line 1:15:
Schema ******** does not exist
SQL statement: SELECT * FROM *******._ACTIVITY
The database and schema definitely exists as i can query from AWS Athena.
I am able to successfully establish the connection using the JDBC driver. 
NOTE: Libref ATHENA was successfully assigned as follows:
Engine: JDBC
 
Any info on how to fix this would be very helpful and much appreciated.
Version history
Last update:
‎10-17-2019 01:38 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags