BookmarkSubscribeRSS Feed

Connecting RAM to Amazon S3 buckets using DuckDB

Started 3 hours ago by
Modified 3 hours ago by
Views 74
Spoiler
Organisations hate data movement beyond the necessary minimum.  DuckDB, a versatile, open-source, lightweight query processing engine enables agents built in SAS Retrieval Agent Manager to query and access data in object storage such as Amazon S3 buckets.  Let's look at an example of how to set this up.

 

Agents fulfill user requests using Large Language Models (LLMs), retrieval mechanisms and tools.  Retrieval mechanisms and tools process data.  This raises two questions.

 

1. Where does the data reside?

2. How can an agent access data?

 

Regarding the first question, organisations have started exploring object storage such as Amazon S3 buckets, Azure data lakes and Azure blob storage as options for persisting their data due to their support for open file formats. Open file formats such as Parquet, CSV and JSON help organisations avoid vendor lock-in and offer efficiency and portability benefits. Cheaper cost over other traditional data stores such as Relational Database Management Systems (RDBMS) and filesystems also plays a part.

 

The answer to the second question is tied closely to the agent - LLM contract. In most cases, an agent first retrieves data prior to calling an LLM, at which stage this data is known as context.  Context is the raw material used by LLMs to process the question and generate answers (as the marketer will happily prattle, context 'grounds the LLM in local knowledge' to improve relevance).   The approach through which this data is retrieved and made available as context is called Retrieval Augmented Generation.  A typical RAG flow is illustrated here.  Those who get the slightly dated joke have permission to chuckle.

 

Typical flow of information in a RAG approachTypical flow of information in a RAG approach

 

 

However, this is not the only occasion to access data during an agentic transaction.  The LLM might either directly answer the question, or it might request further data operations.  This further operation is executed using a tool call, which usually runs in a Model Context Protocol (MCP) server (or, as another favourite regurgitation commonly goes around, the "USB-C for AI").

 

 

Flow of information involving MCP tool callsFlow of information involving MCP tool calls

 

The agent needs to retrieve data performantly, given that agentic interactions are mostly characterised as real-time and involve possibly many concurrent interactions. We must avoid time-consuming and laborious data connections and having to copy large data volumes from a persistence layer to a compute layer.

 

DuckDB, a lightweight, open-source query processing engine proves ideal for such workloads.  As you familiarise yourselves with DuckDB, you'll appreciate that it's not like most conventional database systems running in a separate server but rather runs in-process in the same compute environment intended for data processing.  Furthermore, DuckDB's wide and expanding extension ecosystem facilitates native access to several open file formats such as Parquet, JSON and CSV.  

 

This article looks at SAS Retrieval Agent Manager (RAM), a low-code SAS offering for building and deploying agents and details how to use DuckDB to connect an agent (and its supporting components) to data files in an Amazon S3 bucket offered through Amazon Web Services (AWS).    RAM offers three coding templates - the source template, the agent template and the tool template - all of which can be used to connect to S3 based on the use case. As a  suggestion,

 

1.  Initial retrieval of data to use within a collection for RAG uses the Source template.

2. Retrieval of non-vectorised data during an interaction, i.e. to perform RAG,  uses the Agent template

3. Retrieving data to process during the execution of a tool call  uses the Tool Server template

 

Create an S3 bucket and upload data to the same

 

There are several tutorials available on the internet showing how to do this.  My favourite is AWS's own Getting Started with Amazon S3.

 

For this example, I created a general purpose bucket and uploaded a parquet file containing structured data about retail product reviews.

 

Example of a general-purpose S3 bucketExample of a general-purpose S3 bucket

 

Establish credentials and access to the S3 bucket

 

Enterprises stablish access controls to prevent making their S3 buckets and data contents publicly accessible.  For this purpose, I (actually, my colleague who's an administrator, to be precise) used the Identity Access Management (IAM) service of AWS which controls access to AWS resources.  A simple overview of the process is as follows.

i. Create an IAM user which is an identity created for specific person, application, or project that requires access to AWS resources.


ii. Grant permissions to the IAM user through IAM policies. These policies define what actions the user can perform on S3 resources. Permissions can range from read-only access to full administrative control, depending on the requirements of the use case.


iii. Finally, obtain credentials which allow applications to request operations under the IAM identity.  This comprises of two values, the aws_access_key_id which acts like a username, and the aws_secret_access_key, which acts like a password.

The third step is important.  Save your aws_access_key_id & aws_secret_access_key in a secure location because you shall need it to use in RAM.

 

 

Connecting RAM to S3

 

While this example focusses on the MCP Tool Server template in RAM, the same approach applies for the other two templates - source and agent - as well.  To begin with, let's save our credentials in our RAM server for access by programs.  

 

1. Create a new tool server template (or open an existing one)

 

2. Navigate to the Environment Variables tab

 

3. Click on the (+) button to create a new variable.

 

4. Name it s3_access_key_id (note the change in name - 's3' instead of 'aws', meant as a convenience)  and add the appropriate value for aws_access_key_id  in the Default Value field.

 

5. Check the "Secret" check box which prevents unnecessary  exposure of these values to other eyes (though, remember that this is not complete protection of credentials)

 

6.  Do the same (steps 3-5) for s3_secret_access_key (which is the name you give in the Environment Variables tab, containing the value for aws_secret_access_key)

 

 

At all times, take care to prevent unauthorised access to keys.  Also remember that template edit access is available to only RAM Admins.

 

The net effect of this step is that the credentials are available as environment variables in the Code section of the MCP template.  The next step involves using the DuckDB Python package to help establish a connection and query S3 files.

 

1. Move to the Code tab and the "requirements.txt" sub-tab of the MCP Tool Server template.

 

2. Add duckdb to requirements.txt.  This ensures that DuckDB is installed the next time the tool server is initialised.

 

3. Move to the "run.py" sub-tab and enter the following code.

 

import duckdb as dd

con = dd.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
con.execute(f"set s3_access_key_id= '{os.getenv('s3_access_key_id')}'")
con.execute(f"set s3_secret_access_key = '{os.getenv('s3_secret_access_key')}'")

 

httpfs  is a DuckDB extension that allows reading of data from remote files in S3 and HTTP-compatible file storage.   The renaming of the credential variables from "aws" to "s3" was a convenience to facilitate mapping to the correct settings within DuckDB that enable DuckDB to authenticate against the S3 bucket while attempting a query.

 

The query itself is usually defined within a tool definition block. As an example, the following tool counts the number of records in a table (table name provided as a parameter) and returns the same.  Note that the file_name parameter value contains the full S3 URL to the Parquet file queried.  And, of course, other functions apply when you wish to read from JSON or CSV or any other format.

 

@tool
def get_nbr_records(file_name: str) -> int:
    """Return the total number of records in the specified file"""
    nbr_records = con.execute(f"SELECT COUNT(*) as nbr_records FROM read_parquet('{file_name}');").fetchone()[0]
    return nbr_records

 

Refer RAM documentation for further instructions on creating and running a tool server.  The following video illustrates a couple of commands that take advantage of the S3 connection.  Notice the details window which, when opened, illustrates the chain of commands flowing from the agent to the LLM and then to the tool server.  

 

 

(view in My Videos)

 

DuckDB serves as a universal access connector to query data in remote filesystems and object storage in an efficient manner.  For further details regarding how Parquet files and DuckDB support these workloads, refer these articles.

 

1. Performant Aggregations on Parquet using DuckDB

2. Chatting with Your Data, Wherever It Lives: Unlock Insights through DuckDB and Open File Formats

3.  Parquet Partitioning in DuckDB: Some Points for Consideration

 

 

Enjoy interacting with your S3 files in Retrieval Agent Manager.  Do get in touch with your comments.

Contributors
Version history
Last update:
3 hours ago
Updated by:

Viya Copilot Motion Graphic.gif

Ready to see what SAS Viya Copilot can do?

Visit the Tips & Tricks page for setup guidance, demos, and practical examples that show how Copilot supports your workflows.

Get Started →

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags