Connecting RAM to Amazon S3 buckets using DuckDB

Organisations hate data movement beyond the necessary minimum. DuckDB, a versatile, open-source, lightweight query processing engine enables agents built in SAS Retrieval Agent Manager to query and access data in object storage such as Amazon S3 buckets. Let's look at an example of how to set this up.

Agents fulfill user requests using Large Language Models (LLMs), retrieval mechanisms and tools. Retrieval mechanisms and tools process data. This raises two questions.

1. Where does the data reside?

2. How can an agent access data?

Regarding the first question, organisations have started exploring object storage such as Amazon S3 buckets, Azure data lakes and Azure blob storage as options for persisting their data due to their support for open file formats. Open file formats such as Parquet, CSV and JSON help organisations avoid vendor lock-in and offer efficiency and portability benefits. Cheaper cost over other traditional data stores such as Relational Database Management Systems (RDBMS) and filesystems also plays a part.

The answer to the second question is tied closely to the agent - LLM contract. In most cases, an agent first retrieves data prior to calling an LLM, at which stage this data is known as context. Context is the raw material used by LLMs to process the question and generate answers (as the marketer will happily prattle, context 'grounds the LLM in local knowledge' to improve relevance). The approach through which this data is retrieved and made available as context is called Retrieval Augmented Generation. A typical RAG flow is illustrated here. Those who get the slightly dated joke have permission to chuckle.

Typical flow of information in a RAG approach

However, this is not the only occasion to access data during an agentic transaction. The LLM might either directly answer the question, or it might request further data operations. This further operation is executed using a tool call, which usually runs in a Model Context Protocol (MCP) server (or, as another favourite regurgitation commonly goes around, the "USB-C for AI").

Flow of information involving MCP tool calls

The agent needs to retrieve data performantly, given that agentic interactions are mostly characterised as real-time and involve possibly many concurrent interactions. We must avoid time-consuming and laborious data connections and having to copy large data volumes from a persistence layer to a compute layer.

DuckDB, a lightweight, open-source query processing engine proves ideal for such workloads. As you familiarise yourselves with DuckDB, you'll appreciate that it's not like most conventional database systems running in a separate server but rather runs in-process in the same compute environment intended for data processing. Furthermore, DuckDB's wide and expanding extension ecosystem facilitates native access to several open file formats such as Parquet, JSON and CSV.

This article looks at SAS Retrieval Agent Manager (RAM), a low-code SAS offering for building and deploying agents and details how to use DuckDB to connect an agent (and its supporting components) to data files in an Amazon S3 bucket offered through Amazon Web Services (AWS). RAM offers three coding templates - the source template, the agent template and the tool template - all of which can be used to connect to S3 based on the use case. As a suggestion,

1. Initial retrieval of data to use within a collection for RAG uses the Source template.

2. Retrieval of non-vectorised data during an interaction, i.e. to perform RAG, uses the Agent template

3. Retrieving data to process during the execution of a tool call uses the Tool Server template

Create an S3 bucket and upload data to the same

There are several tutorials available on the internet showing how to do this. My favourite is AWS's own Getting Started with Amazon S3.

For this example, I created a general purpose bucket and uploaded a parquet file containing structured data about retail product reviews.

Example of a general-purpose S3 bucket

Establish credentials and access to the S3 bucket

Enterprises stablish access controls to prevent making their S3 buckets and data contents publicly accessible. For this purpose, I (actually, my colleague who's an administrator, to be precise) used the Identity Access Management (IAM) service of AWS which controls access to AWS resources. A simple overview of the process is as follows.

i. Create an IAM user which is an identity created for specific person, application, or project that requires access to AWS resources.

ii. Grant permissions to the IAM user through IAM policies. These policies define what actions the user can perform on S3 resources. Permissions can range from read-only access to full administrative control, depending on the requirements of the use case.

iii. Finally, obtain credentials which allow applications to request operations under the IAM identity. This comprises of two values, the aws_access_key_id which acts like a username, and the aws_secret_access_key, which acts like a password.

The third step is important. Save your aws_access_key_id & aws_secret_access_key in a secure location because you shall need it to use in RAM.

Connecting RAM to S3

While this example focusses on the MCP Tool Server template in RAM, the same approach applies for the other two templates - source and agent - as well. To begin with, let's save our credentials in our RAM server for access by programs.

1. Create a new tool server template (or open an existing one)

2. Navigate to the Environment Variables tab

3. Click on the (+) button to create a new variable.

4. Name it s3_access_key_id (note the change in name - 's3' instead of 'aws', meant as a convenience) and add the appropriate value for aws_access_key_id in the Default Value field.

5. Check the "Secret" check box which prevents unnecessary exposure of these values to other eyes (though, remember that this is not complete protection of credentials)

6. Do the same (steps 3-5) for s3_secret_access_key (which is the name you give in the Environment Variables tab, containing the value for aws_secret_access_key)

At all times, take care to prevent unauthorised access to keys. Also remember that template edit access is available to only RAM Admins.

The net effect of this step is that the credentials are available as environment variables in the Code section of the MCP template. The next step involves using the DuckDB Python package to help establish a connection and query S3 files.

1. Move to the Code tab and the "requirements.txt" sub-tab of the MCP Tool Server template.

2. Add duckdb to requirements.txt. This ensures that DuckDB is installed the next time the tool server is initialised.

3. Move to the "run.py" sub-tab and enter the following code.

import duckdb as dd

con = dd.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
con.execute(f"set s3_access_key_id= '{os.getenv('s3_access_key_id')}'")
con.execute(f"set s3_secret_access_key = '{os.getenv('s3_secret_access_key')}'")

httpfs is a DuckDB extension that allows reading of data from remote files in S3 and HTTP-compatible file storage. The renaming of the credential variables from "aws" to "s3" was a convenience to facilitate mapping to the correct settings within DuckDB that enable DuckDB to authenticate against the S3 bucket while attempting a query.

The query itself is usually defined within a tool definition block. As an example, the following tool counts the number of records in a table (table name provided as a parameter) and returns the same. Note that the file_name parameter value contains the full S3 URL to the Parquet file queried. And, of course, other functions apply when you wish to read from JSON or CSV or any other format.

@tool
def get_nbr_records(file_name: str) -> int:
    """Return the total number of records in the specified file"""
    nbr_records = con.execute(f"SELECT COUNT(*) as nbr_records FROM read_parquet('{file_name}');").fetchone()[0]
    return nbr_records

Refer RAM documentation for further instructions on creating and running a tool server. The following video illustrates a couple of commands that take advantage of the S3 connection. Notice the details window which, when opened, illustrates the chain of commands flowing from the agent to the LLM and then to the tool server.

(view in My Videos)

DuckDB serves as a universal access connector to query data in remote filesystems and object storage in an efficient manner. For further details regarding how Parquet files and DuckDB support these workloads, refer these articles.

1. Performant Aggregations on Parquet using DuckDB

2. Chatting with Your Data, Wherever It Lives: Unlock Insights through DuckDB and Open File Formats

3. Parquet Partitioning in DuckDB: Some Points for Consideration

Enjoy interacting with your S3 files in Retrieval Agent Manager. Do get in touch with your comments.