I have been reading about ways to access S3 data via NFS, and wondered if it was possible to use Amazon’s Athena as a data extraction interface to S3. Athena supports ODBC/JDBC, so it should be viable. So how would you read data from S3 via Athena in a simple “delta loading” data ingestion scenario?
Amazon Athena is a query service specifically designed for accessing data in S3. It is one of the core building blocks for serverless architectures in Amazon Web Services (AWS) and is often used in real-time data ingestion scenarios (e.g. IoT cases).
There are two major benefits to using Athena. First, it allows you to use standard SQL for data retrieval, which greatly simplifies data ingestion for raw data in S3. Second, it’s a serverless service which means that all infrastructure details are hidden. As a customer, you focus on using the service and AWS will manage everything else. Athena use is charged on a “per query” basis and the costs depend on the amount of data scanned.
There are basically two ways of using Athena from a third party client (SAS in this case): JDBC and ODBC. I chose the JDBC route because it seemed simpler. Athena is easier to understand if you’re familiar with Apache Hive and Hadoop. Like Hive, Athena is a schema-on-read query engine. This means that you store your data without needing to describe its layout (or schema). The schema is provided later, when you need to read the data.
The data is already available in S3 when you want to read it, so your schema definition (the CREATE TABLE statement) needs to point to the storage location. Athena therefore uses the same concept of EXTERNAL tables with a LOCATION attribute as Hive. Like Hive, Athena also treats your data in S3 as read-only, so only SELECT queries are supported.
Finally, Athena treats folders in S3 buckets very like Hive treats folders in HDFS: all data files in a folder or subfolders are considered to belong to the table. If the subfolders fit a certain naming pattern, they are treated as partitions, and this can be leveraged to optimize query performance.
I set up a new EC2 instance which served as a client to Athena. When clicking through the setup wizard, I created and assigned an IAM role, allowing the machine to access both the S3 and Athena services:
I would recommend assigning an IAM role at this point, because it simplifies the process of accessing S3 and Athena by avoiding the need for later client authentication. This is particularly helpful if you are using federated accounts.
The next step is to upload some data to S3. I’ve been using JSON-formatted sample data for my tests. This data has a timestamp attribute which I used for splitting the JSON files (one file per day). You can either use the S3 GUI for uploading or use the AWS CLI, which is an easy way to test if your IAM role is working.
To install the AWS CLI:
python -m pip install awscli --upgrade
Now let’s create a bucket and upload some data:
aws s3 mb s3://gerhje-test-bucket-123/ --region us-east-1 --endpoint-url https://s3.us-east-1.amazonaws.com
make_bucket: gerhje-test-bucket-123
aws s3 cp testfile-0.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-00-00/
upload: ./testfile-0.json to s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-00-00/testfile-0.json
aws s3 cp testfile-1.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-01-00/
upload: ./testfile-1.json to s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-01-00/testfile-1.json
aws s3 ls
2019-03-10 15:24:03 gerhje-test-bucket-123
I stored the data in a folder hierarchy made of a root folder (“rigdata”) with subfolders (“rec_dt=XYZ”). These subfolders will be treated as data partitions in Athena. The reason for this approach is that your (external) source data provider systems is assumed to constantly upload data to S3 and you need an efficient data ingestion strategy using delta loads.
With the data in place, you can now head over to the Athena GUI in the AWS web console . The first step is to run a CREATE DATABASE statement:
create database rigdb
Now define the rigdata table, pointing to the S3 data you have just uploaded:
CREATE EXTERNAL TABLE IF NOT EXISTS rigdb.rigdata (
rig STRING,
well_depth INT,
bit_depth FLOAT,
recorded_at TIMESTAMP,
status_code INT,
status_message STRING
)
PARTITIONED BY (rec_dt STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://gerhje-test-bucket-123/rigdata/'
Again, if you’ve worked with Apache Hive, this will look very familiar:
If you submit this statement, the GUI will warn you that you need to update the table metadata to make Athena aware of existing partitions. The easiest way to do this is to run this command:
MSCK REPAIR TABLE rigdb.rigdata
This will load all partitions at once. Of course, in real life, a data ingestion strategy using delta loads would use a different approach and continuously append new partitions (using an ALTER TABLE statement), but it’s probably best not to worry about that at this stage. It is then worth checking if this has worked:
SELECT count(*) as c, rec_dt
FROM rigdb.rigdata
GROUP BY rec_dt
As a last preparation step before you can finally switch to SAS, you need to download the Athena JDBC driver to the EC2 instance:
yum -y install wget java
mkdir -p /opt/sas/athena-jdbc
cd /opt/sas/athena-jdbc
wget https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC_2.0.7/AthenaJDBC41_2.0.7.jar
With this in place, you can now move to SAS Studio and use the test code to read the S3 data:
libname athena clear;
libname athena jdbc schema="rigdb"
driverclass="com.simba.athena.jdbc.Driver"
URL="jdbc:awsathena://AwsRegion=us-east-1;S3OutputLocation=s3://gerhje-test-bucket-123/athena-results/;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider;Schema=rigdb"
classpath="/opt/sas/athena-jdbc/AthenaJDBC41_2.0.7.jar";
cas mySession sessopts=(caslib=casuser timeout=1800 locale="en_US");
caslib _all_ assign;
proc casutil;
load data=athena.rigdata outcaslib="public"
casout="rigdata";
run;
cas mySession terminate;
The most interesting piece in this code is the JDBC URL= parameter. It contains quite a lot of information:
You should be able to see that the existing S3 data has been successfully loaded to CAS by now. But what happens if you want to add another slice of data?
aws s3 cp testfile-2.json s3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-02-00/
ALTER TABLE rigdb.rigdata
ADD PARTITION (rec_dt='2017-06-29-02-00')
LOCATION 's3://gerhje-test-bucket-123/rigdata/rec_dt=2017-06-29-02-00/'
data public.rigdata(append=yes);
set athena.rigdata(where=(rec_dt='2017-06-29-02-00'));
run;
There are, of course, many other options available for SAS to read data from AWS S3. Using a simple filename statement or using PROC S3, CAS in particular can even read CSV and SASHDAT files directly from S3. There are also the NFS options that I mentioned at the start of this blog.
The approach I have described simply gives another option. Depending on the task you need to solve, it may be useful. I think the main benefits of using Athena are its flexibility in dealing with raw data formats and its support for filtering and partitioning. This is particularly helpful in delta loading scenarios.
I hope this has provided some useful and interesting information. Please do feel free to send me your feedback!
Another interesting option which we're currently exploring in a project is Lustre FSx. It allows to link the Lustre file system with an S3 bucket. Lustre then surfaces all the objects in S3 (metadata) but only replicates the object to Lustre the first time it gets actually used (as background process).
This allows the consuming processes to only interact with the Lustre distributed file system even though the source of truth is stored in S3.
There are also commands to then write back the objects to S3 and to release them from Lustre (=only keep the metadata).
Hi Patrick, that sounds like a interesting option. I know that we’re often using Lustre for SAS Grid environments because it’s reliable and fast, but it also comes at a certain price tag. So your approach might be a way to get a handle on that, I guess?
Hi
I am trying to follow this and read files in S3 bucket using the capability of Athena but undergoing issues.
The files in s3 are in multipart parquet format. Here in this example you have created a table and reading that from SAS, which i understand . however in my case reading a table having parquet format files are not working.
Do you have any idea on how to get this working please?. it complains about the below error
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.