I have been reading about ways to access S3 data via NFS, and wondered if it was possible to use Amazon’s Athena as a data extraction interface to S3. Athena supports ODBC/JDBC, so it should be viable. So how would you read data from S3 via Athena in a simple “delta loading” data ingestion scenario?
Amazon Athena is a query service specifically designed for accessing data in S3. It is one of the core building blocks for serverless architectures in Amazon Web Services (AWS) and is often used in real-time data ingestion scenarios (e.g. IoT cases).
There are two major benefits to using Athena. First, it allows you to use standard SQL for data retrieval, which greatly simplifies data ingestion for raw data in S3. Second, it’s a serverless service which means that all infrastructure details are hidden. As a customer, you focus on using the service and AWS will manage everything else. Athena use is charged on a “per query” basis and the costs depend on the amount of data scanned.
There are basically two ways of using Athena from a third party client (SAS in this case): JDBC and ODBC. I chose the JDBC route because it seemed simpler. Athena is easier to understand if you’re familiar with Apache Hive and Hadoop. Like Hive, Athena is a schema-on-read query engine. This means that you store your data without needing to describe its layout (or schema). The schema is provided later, when you need to read the data.
The data is already available in S3 when you want to read it, so your schema definition (the CREATE TABLE statement) needs to point to the storage location. Athena therefore uses the same concept of EXTERNAL tables with a LOCATION attribute as Hive. Like Hive, Athena also treats your data in S3 as read-only, so only SELECT queries are supported.
Finally, Athena treats folders in S3 buckets very like Hive treats folders in HDFS: all data files in a folder or subfolders are considered to belong to the table. If the subfolders fit a certain naming pattern, they are treated as partitions, and this can be leveraged to optimize query performance.
I set up a new EC2 instance which served as a client to Athena. When clicking through the setup wizard, I created and assigned an IAM role, allowing the machine to access both the S3 and Athena services:
I would recommend assigning an IAM role at this point, because it simplifies the process of accessing S3 and Athena by avoiding the need for later client authentication. This is particularly helpful if you are using federated accounts.
The next step is to upload some data to S3. I’ve been using JSON-formatted sample data for my tests. This data has a timestamp attribute which I used for splitting the JSON files (one file per day). You can either use the S3 GUI for uploading or use the AWS CLI, which is an easy way to test if your IAM role is working.
To install the AWS CLI:
python -m pip install awscli --upgrade
Now let’s create a bucket and upload some data:
aws s3 mb s3://gerhje-test-bucket-123/ --region us-east-1 --endpoint-url https://s3.us-east-1.amazonaws.com
aws s3 cp testfile-0.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-00-00/
upload: ./testfile-0.json to s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-00-00/testfile-0.json
aws s3 cp testfile-1.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-01-00/
upload: ./testfile-1.json to s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-01-00/testfile-1.json
aws s3 ls
2019-03-10 15:24:03 gerhje-test-bucket-123
I stored the data in a folder hierarchy made of a root folder (“rigdata”) with subfolders (“rec_dt=XYZ”). These subfolders will be treated as data partitions in Athena. The reason for this approach is that your (external) source data provider systems is assumed to constantly upload data to S3 and you need an efficient data ingestion strategy using delta loads.
With the data in place, you can now head over to the Athena GUI in the AWS web console . The first step is to run a CREATE DATABASE statement:
create database rigdb
Now define the rigdata table, pointing to the S3 data you have just uploaded:
CREATE EXTERNAL TABLE IF NOT EXISTS rigdb.rigdata (
PARTITIONED BY (rec_dt STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
Again, if you’ve worked with Apache Hive, this will look very familiar:
If you submit this statement, the GUI will warn you that you need to update the table metadata to make Athena aware of existing partitions. The easiest way to do this is to run this command:
MSCK REPAIR TABLE rigdb.rigdata
This will load all partitions at once. Of course, in real life, a data ingestion strategy using delta loads would use a different approach and continuously append new partitions (using an ALTER TABLE statement), but it’s probably best not to worry about that at this stage. It is then worth checking if this has worked:
SELECT count(*) as c, rec_dt
GROUP BY rec_dt
As a last preparation step before you can finally switch to SAS, you need to download the Athena JDBC driver to the EC2 instance:
yum -y install wget java
mkdir -p /opt/sas/athena-jdbc
With this in place, you can now move to SAS Studio and use the test code to read the S3 data:
libname athena clear;
libname athena jdbc schema="rigdb"
cas mySession sessopts=(caslib=casuser timeout=1800 locale="en_US");
caslib _all_ assign;
load data=athena.rigdata outcaslib="public"
cas mySession terminate;
The most interesting piece in this code is the JDBC URL= parameter. It contains quite a lot of information:
You should be able to see that the existing S3 data has been successfully loaded to CAS by now. But what happens if you want to add another slice of data?
aws s3 cp testfile-2.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-02-00/
ALTER TABLE rigdb.rigdata
ADD PARTITION (rec_dt='2017-06-29-02-00')
There are, of course, many other options available for SAS to read data from AWS S3. Using a simple filename statement or using PROC S3, CAS in particular can even read CSV and SASHDAT files directly from S3. There are also the NFS options that I mentioned at the start of this blog.
The approach I have described simply gives another option. Depending on the task you need to solve, it may be useful. I think the main benefits of using Athena are its flexibility in dealing with raw data formats and its support for filtering and partitioning. This is particularly helpful in delta loading scenarios.
I hope this has provided some useful and interesting information. Please do feel free to send me your feedback!
The early bird rate has been extended! Register by March 18 for just $695 - $100 off the standard rate.
Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.