Using Amazon’s Athena as a data extraction interface to S3

1 Like

I have been reading about ways to access S3 data via NFS, and wondered if it was possible to use Amazon’s Athena as a data extraction interface to S3. Athena supports ODBC/JDBC, so it should be viable. So how would you read data from S3 via Athena in a simple “delta loading” data ingestion scenario?

What is Amazon Athena?

Amazon Athena is a query service specifically designed for accessing data in S3. It is one of the core building blocks for serverless architectures in Amazon Web Services (AWS) and is often used in real-time data ingestion scenarios (e.g. IoT cases).

There are two major benefits to using Athena. First, it allows you to use standard SQL for data retrieval, which greatly simplifies data ingestion for raw data in S3. Second, it’s a serverless service which means that all infrastructure details are hidden. As a customer, you focus on using the service and AWS will manage everything else. Athena use is charged on a “per query” basis and the costs depend on the amount of data scanned.

How can you use Amazon Athena to retrieve data from S3?

There are basically two ways of using Athena from a third party client (SAS in this case): JDBC and ODBC. I chose the JDBC route because it seemed simpler. Athena is easier to understand if you’re familiar with Apache Hive and Hadoop. Like Hive, Athena is a schema-on-read query engine. This means that you store your data without needing to describe its layout (or schema). The schema is provided later, when you need to read the data.

The data is already available in S3 when you want to read it, so your schema definition (the CREATE TABLE statement) needs to point to the storage location. Athena therefore uses the same concept of EXTERNAL tables with a LOCATION attribute as Hive. Like Hive, Athena also treats your data in S3 as read-only, so only SELECT queries are supported.

Finally, Athena treats folders in S3 buckets very like Hive treats folders in HDFS: all data files in a folder or subfolders are considered to belong to the table. If the subfolders fit a certain naming pattern, they are treated as partitions, and this can be leveraged to optimize query performance.

Accessing S3 data via Amazon Athena using Python and SAS

I set up a new EC2 instance which served as a client to Athena. When clicking through the setup wizard, I created and assigned an IAM role, allowing the machine to access both the S3 and Athena services:

I would recommend assigning an IAM role at this point, because it simplifies the process of accessing S3 and Athena by avoiding the need for later client authentication. This is particularly helpful if you are using federated accounts.

Create a bucket and upload some data

The next step is to upload some data to S3. I’ve been using JSON-formatted sample data for my tests. This data has a timestamp attribute which I used for splitting the JSON files (one file per day). You can either use the S3 GUI for uploading or use the AWS CLI, which is an easy way to test if your IAM role is working.

To install the AWS CLI:

python -m pip install awscli --upgrade

Now let’s create a bucket and upload some data:

aws s3 mb s3://gerhje-test-bucket-123/ --region us-east-1 --endpoint-url https://s3.us-east-1.amazonaws.com
make_bucket: gerhje-test-bucket-123

aws s3 cp testfile-0.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-00-00/
upload: ./testfile-0.json to s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-00-00/testfile-0.json

aws s3 cp testfile-1.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-01-00/
upload: ./testfile-1.json to s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-01-00/testfile-1.json

aws s3 ls
2019-03-10 15:24:03 gerhje-test-bucket-123

I stored the data in a folder hierarchy made of a root folder (“rigdata”) with subfolders (“rec_dt=XYZ”). These subfolders will be treated as data partitions in Athena. The reason for this approach is that your (external) source data provider systems is assumed to constantly upload data to S3 and you need an efficient data ingestion strategy using delta loads.

Create Athena metadata for accessing the S3 data

With the data in place, you can now head over to the Athena GUI in the AWS web console . The first step is to run a CREATE DATABASE statement:

create database rigdb

Now define the rigdata table, pointing to the S3 data you have just uploaded:

CREATE EXTERNAL TABLE IF NOT EXISTS rigdb.rigdata (
  rig            STRING,
  well_depth     INT,
  bit_depth      FLOAT,
  recorded_at    TIMESTAMP,
  status_code    INT,
  status_message STRING
)
PARTITIONED BY (rec_dt STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://gerhje-test-bucket-123/rigdata/'

Again, if you’ve worked with Apache Hive, this will look very familiar:

PARTITIONED BY: this declares that the table uses partitions. The subfolder names will be treated as a virtual column called “rec_dt”
ROW FORMAT SERDE: the source data is held in the JSON format, so you need to specify an appropriate serde (“serializer-deserializer”) class
LOCATION: this points to the top folder in S3 where the data is stored

If you submit this statement, the GUI will warn you that you need to update the table metadata to make Athena aware of existing partitions. The easiest way to do this is to run this command:

MSCK REPAIR TABLE rigdb.rigdata

This will load all partitions at once. Of course, in real life, a data ingestion strategy using delta loads would use a different approach and continuously append new partitions (using an ALTER TABLE statement), but it’s probably best not to worry about that at this stage. It is then worth checking if this has worked:

SELECT count(*) as c, rec_dt
  FROM rigdb.rigdata
  GROUP BY rec_dt

Set up a JDBC connection from SAS to upload the S3 data to CAS

As a last preparation step before you can finally switch to SAS, you need to download the Athena JDBC driver to the EC2 instance:

yum -y install wget java
mkdir -p /opt/sas/athena-jdbc
cd /opt/sas/athena-jdbc
wget https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC_2.0.7/AthenaJDBC41_2.0.7.jar

With this in place, you can now move to SAS Studio and use the test code to read the S3 data:

libname athena clear;
libname athena jdbc schema="rigdb"
   driverclass="com.simba.athena.jdbc.Driver"
   URL="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gerhje-test-bucket-123/athena-results/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider;Schema=rigdb"
   classpath="/opt/sas/athena-jdbc/AthenaJDBC41_2.0.7.jar";

cas mySession sessopts=(caslib=casuser timeout=1800 locale="en_US");
caslib _all_ assign;

proc casutil;
   load data=athena.rigdata outcaslib="public"
   casout="rigdata";
run;

cas mySession terminate;

The most interesting piece in this code is the JDBC URL= parameter. It contains quite a lot of information:

AwsRegion: this points to the AWS region where you set up your Athena database and table (usually “us-east-1” as a default).
S3OutputLocation: this specifies a temporary folder which Athena can use for storing query results.
AwsCredentialsProviderClass: one of the pluggable credential provider classes. “InstanceProfileCredentialsProvider” is the one to choose if using an IAM role attached to an EC2 instance.
Schema: the Athena database to use. Note that you need to repeat the schema a second time in the libname statement.

Moving to delta loading

You should be able to see that the existing S3 data has been successfully loaded to CAS by now. But what happens if you want to add another slice of data?

Upload the data to S3

aws s3 cp testfile-2.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-02-00/

Update Athena’s metadata to recognize the new partition

ALTER TABLE rigdb.rigdata
ADD PARTITION (rec_dt='2017-06-29-02-00')
LOCATION 's3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-02-00/'

Append the new slice of data to your CAS table

data public.rigdata(append=yes);
   set athena.rigdata(where=(rec_dt='2017-06-29-02-00'));
run;

Check the results

Conclusion

There are, of course, many other options available for SAS to read data from AWS S3. Using a simple filename statement or using PROC S3, CAS in particular can even read CSV and SASHDAT files directly from S3. There are also the NFS options that I mentioned at the start of this blog.

The approach I have described simply gives another option. Depending on the task you need to solve, it may be useful. I think the main benefits of using Athena are its flexibility in dealing with raw data formats and its support for filtering and partitioning. This is particularly helpful in delta loading scenarios.

I hope this has provided some useful and interesting information. Please do feel free to send me your feedback!

Patrick · ‎11-10-2019

Another interesting option which we're currently exploring in a project is Lustre FSx. It allows to link the Lustre file system with an S3 bucket. Lustre then surfaces all the objects in S3 (metadata) but only replicates the object to Lustre the first time it gets actually used (as background process).

This allows the consuming processes to only interact with the Lustre distributed file system even though the source of truth is stored in S3.

There are also commands to then write back the objects to S3 and to release them from Lustre (=only keep the metadata).

HansEdert · ‎11-11-2019

Hi Patrick, that sounds like a interesting option. I know that we’re often using Lustre for SAS Grid environments because it’s reliable and fast, but it also comes at a certain price tag. So your approach might be a way to get a handle on that, I guess?

Ramprakash · ‎08-19-2020

Hi

I am trying to follow this and read files in S3 bucket using the capability of Athena but undergoing issues.

The files in s3 are in multipart parquet format. Here in this example you have created a table and reading that from SAS, which i understand . however in my case reading a table having parquet format files are not working.

Do you have any idea on how to get this working please?. it complains about the below error

ERROR: Prepare error: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. SYNTAX_ERROR: line 1:15:

Schema ******** does not exist

SQL statement: SELECT * FROM *******._ACTIVITY

The database and schema definitely exists as i can query from AWS Athena.

I am able to successfully establish the connection using the JDBC driver.

NOTE: Libref ATHENA was successfully assigned as follows:

Engine: JDBC

Any info on how to fix this would be very helpful and much appreciated.