Wiki of the Day: Sharpening Your Saw with Retrieval Agent Manager

4 Likes

Wikipedia gave the venerable Encyclopaedia Brittanica a jolt. Large Language Models (LLMs) haven’t quite done the same with Wikipedia yet, but, thanks to their nature as an aggregator and summariser (for good or bad) of knowledge, present a viable alternative to Wikipedia browsing. Also, you can avoid those increasingly frequent requests to donate a dollar (which I argue that one should consider from time to time, but that’s not the issue here).

Wikipedia continues to enrich the knowledge ecosystem as a reliable, collaboratively curated corpus. I, for one, like to have the best of both worlds - maintain my connection to Wikipedia while also consuming its knowledge better through LLMs.

So, I used Retrieval Agent Manager (RAM), a new offering from SAS for simplifying and automating knowledge query systems, to design a nifty Wiki of the Day solution. My idea involved a scheduled daily program to fetch a random article from Wikipedia for purposes of learning and my general betterment. Of course, it’s debatable whether learning about the performance of the Slovakian team at the 1992 World Aquatics Championships (my Wiki of the Day for 18^th October) leads to my betterment, but I tend to keep a broad mind about such things.

How do we approach this?

RAM offers automation and convenient access to Retrieval Augmented Generation (RAG) methods. At a minimum, a RAM project consists of the following main components:

A source of data
A collection to hold data from source documents
A configuration for data processing and query settings
A workflow to control sequence of data flow and dependencies
Interface mechanisms such as direct LLM chats through assistants or agents

While we can experiment with all the above and then some more, my interest, for this use case, focuses on questions of automated data ingest, update and processing. Let’s look at these steps one by one.

Defining my source

RAM provides three source mechanisms. “Local” has the user upload documents manually, “Git” pulls documents located in a folder on a git repository, and Custom, as the name implies, stands for a custom data source. A simple definition of a Custom source is to consider them as “Bring Your Own Code”. In this case, we execute a Python function to populate the source with data.

To start, log on to RAM, go to Sources, and select a New Source. After calling it whatever you want, select the source type as Custom. Once you do so, notice that a new tab called Code appears.

Examine the code provided in the Code tab. This is an example and serves as a scaffold. Setting it aside for a moment, let’s think about our problem independent from SAS, RAM or any piece of code that has sprung up. To fulfil my purpose, I break down my steps into the following:

Access a random Wikipedia article
Download it to a client through a web request
Save it somewhere

To access a random article, Wikipedia has made our life easy through a link on its website pointing to https://en.wikipedia.org/wiki/Special:Random. We can retrieve the content at this URL through a web request.

The requests module in Python serves this purpose. While some commonly used packages might be provided as part of your RAM implementation, you may like to use the package.zip mechanism to add a requirements.txt file containing any required Python packages. Read this page for more details regarding the package.zip method.

Since a program, and not an actual user, makes this web request, you need to add a header to assure Wikipedia that this code comes from a credible and identifiable source. This is accomplished by including a header record as part of your request. Note that this is a useful pattern for many websites, not just Wikipedia and prevents you from a denial of request.

random_wiki_url = "https://en.wikipedia.org/wiki/Special:Random"
headers = {
    'User-Agent': 'MyWikiRandom/1.0 (https://provide/a/website; provide.an@email.com)'
}

To improve readability and analysis, save content as readable text and not HTML. To do this, I use beautifulsoup4, a popular Python package to parse and extract text from HTML. In a future article, I’ll share other ways to clean this content further. For now, we use the get_text() method of beautifulsoup4 (also known as bs4, because obviously you know who’s bs number one) to extract text content from an HTML file.

Finally, we need to associate the content with the “source” we defined earlier. For this purpose, the client object associated with RAM comes in useful. This is the reason you see

from sasram.source import SourceClient

as part of the code. We thus write the content out to a file and save the file to the client object which symbolises the source, through the following two lines of code.

with open("/tmp/Wiki_of_the_Day.txt", "w") as f:
        f.write(soup.get_text())
client.save_file("/tmp/Wiki_of_the_Day.txt")

Note that I keep my filename static, i.e. I always call it Wiki_of_the_Day.txt. This makes every automated run of this code overwrite the same file and assures of today’s (or to be precise, the most recent) random Wikipedia article.

Our final code looks like this:

from sasram.source import SourceClient
import requests
from bs4 import BeautifulSoup

filename = "Wiki_of_the_Day.txt"

random_wiki_url="https://en.wikipedia.org/wiki/Special:Random"
headers = {
    'User-Agent': 'MyWikiRandom/1.0 (https://github.com/SundareshSankaran; sundaresh.sankaran@gmail.com)'
}

def exec(client: SourceClient):
    resp = requests.get(url=random_wiki_url, headers=headers)
    soup = BeautifulSoup(resp.content,'html.parser')
    with open(f"/tmp/{filename}", "w") as f:
        f.write(soup.get_text())
    client.save_file(f"/tmp/{filename}")

Navigate to the Info tab and go to the Schedule field. Here, enter a cron tab expression to schedule the code to run at 2 a.m. every day. The cron tab expression to use is 0 2 * * *. This way, the article is ready for you every morning when you sit at your desk, coffee in hand.

Save the definition and open to the Jobs tab at the bottom of the Sources page. If you are like me, you do not want to wait until 2 a.m. for the job to run. Manually trigger the run through the icon on the top right side of the Jobs section. Move over to the Files section and you now notice a file with the name Wiki_of_the_day.txt appear. You can preview or download the file through the respective icons that appear on the top right-hand side of this section.

Configure my collection

I need RAM to follow a workflow which vectorises this data upon update. Vectorising refers to converting text to points in a number space along a fixed number of dimensions. This vectorized data is stored in a collection for later retrieval and query through a candidate LLM. All these are controlled by a configuration specific to a collection.

You can create more than one configuration based on need. In this case, I go with just one, the champion configuration. To create a collection, go to the collections tab, and select new collection. Provide details such as the name and description, and importantly, select your Wiki_of_the_day (or whatever you named it) as a source to be used in the collection.

Next, create a configuration within a collection. This involves multiple steps, such as choosing an embedding model, a vector database, LLM to be used for query, filters for document types, and parameters governing vectorisation, such as chunking strategy.

It’s important to select the correct Configuration Update Strategy. This affects how changes to a source document are handled in the collection. Choice of an appropriate strategy (whether disabled, Append, Append and sync, or Append, sync and delete) depends on the use case. In this case, my purpose is to have a Wiki of the Day, i.e. allow for the document to change its contents every day. I therefore choose the “Append and sync” (the “Append, Sync and Delete” is also applicable) strategy so that my document collection always reflects the most recent updates. Refer here for an illustration on how to these strategies work.

A small road bump to look out for, regarding user experience, is that the distinction between a collection and a configuration is still not very intuitive, and you may find yourself missing some parameters. The LLM tab is especially tricky (remember to click an additional save button after choosing the LLM). Refer documentation here (https://go.documentation.sas.com/doc/en/ragntmgrcdc/default/ragntmgrug/titlepage.htm?fromDefault=) in case of any questions.

Automate for daily updates

We want daily changes in the source file to automatically reflect in the collection. For this purpose, do the following:

Ensure that your configuration is selected as the “Champion” configuration through the Collections tab.
Navigate to the Automation tab, select “Wiki_of_the_day” from your available pipelines, and ensure that you connect the Source (yellow box) to the Champion Configuration (the light blue box) in the resultant flowchart. Make sure you save your changes. It is possible you might need to re-authenticate with correct privileges to save your changes.

This automatically triggers a vectorisation job every time a source file is created or updated and keeps the collection in sync with the source.

Time to Chat

It’s time to query your collection through a chat interface, powered by an LLM using Retrieval Augmented Generation.

A pedantic approach would suggest you ideally create an Agent to orchestrate this chat experience, but, for this simple example, I subscribe to the principle that an “Agent is as Agent does”. For my use case, the Assistant interface in RAM performs the role of an agent.

To interact with your collection, go to the Chat tab, select your configuration and an available LLM from your configuration, and start chatting. Need to break the ice? Simply type,

“What’s today’s article all about?”

And sure enough, after a short while, you get a response.

Don’t you notice? It’s working already! I was ready to spend the rest of my life believing the Hustler was just a Paul Newman movie. Now, I know much better!

Looking Ahead

As mentioned earlier, this was a simple and fun example to get you started (and give your brains a bit of daily food for thought). However, you can extend this project in multiple ways. For example,

Build a history of Wikipedia articles, incrementally built on a daily or other schedule. Tongue in cheek, do not take this idea too far because it is possible you end up cloning Wikipedia after ~ 18,000 years !
Consider pre-processing code to clean and filter content before vectorising the same.
Consider how an appropriately designed agent could help in performing valuable upstream and downstream tasks, such as categorising an article or emailing an article to a friend.

These are some initial thoughts and expect some follow up articles on the same. Feel free to let me know your thoughts and ideas through email or the comments section.

duras · ‎10-27-2025

Great read! 🙂

Sundaresh1 · ‎10-29-2025

Thank you, @duras

SylvieFaucillon · ‎11-03-2025

Great content @Sundaresh1 ! Thanks for sharing. The 1992 World Aquatics Championships ... well, learning never stops ! My Wiki of the day was about South Sandwich Islands 😉, what do you say ?