Wikipedia gave the venerable Encyclopaedia Brittanica a jolt. Large Language Models (LLMs) haven’t quite done the same with Wikipedia yet, but, thanks to their nature as an aggregator and summariser (for good or bad) of knowledge, present a viable alternative to Wikipedia browsing. Also, you can avoid those increasingly frequent requests to donate a dollar (which I argue that one should consider from time to time, but that’s not the issue here).
Wikipedia continues to enrich the knowledge ecosystem as a reliable, collaboratively curated corpus. I, for one, like to have the best of both worlds - maintain my connection to Wikipedia while also consuming its knowledge better through LLMs.
So, I used Retrieval Agent Manager (RAM), a new offering from SAS for simplifying and automating knowledge query systems, to design a nifty Wiki of the Day solution. My idea involved a scheduled daily program to fetch a random article from Wikipedia for purposes of learning and my general betterment. Of course, it’s debatable whether learning about the performance of the Slovakian team at the 1992 World Aquatics Championships (my Wiki of the Day for 18th October) leads to my betterment, but I tend to keep a broad mind about such things.
RAM offers automation and convenient access to Retrieval Augmented Generation (RAG) methods. At a minimum, a RAM project consists of the following main components:
While we can experiment with all the above and then some more, my interest, for this use case, focuses on questions of automated data ingest, update and processing. Let’s look at these steps one by one.
RAM provides three source mechanisms. “Local” has the user upload documents manually, “Git” pulls documents located in a folder on a git repository, and Custom, as the name implies, stands for a custom data source. A simple definition of a Custom source is to consider them as “Bring Your Own Code”. In this case, we execute a Python function to populate the source with data.
To start, log on to RAM, go to Sources, and select a New Source. After calling it whatever you want, select the source type as Custom. Once you do so, notice that a new tab called Code appears.
Examine the code provided in the Code tab. This is an example and serves as a scaffold. Setting it aside for a moment, let’s think about our problem independent from SAS, RAM or any piece of code that has sprung up. To fulfil my purpose, I break down my steps into the following:
To access a random article, Wikipedia has made our life easy through a link on its website pointing to https://en.wikipedia.org/wiki/Special:Random. We can retrieve the content at this URL through a web request.
The requests module in Python serves this purpose. While some commonly used packages might be provided as part of your RAM implementation, you may like to use the package.zip mechanism to add a requirements.txt file containing any required Python packages. Read this page for more details regarding the package.zip method.
Since a program, and not an actual user, makes this web request, you need to add a header to assure Wikipedia that this code comes from a credible and identifiable source. This is accomplished by including a header record as part of your request. Note that this is a useful pattern for many websites, not just Wikipedia and prevents you from a denial of request.
random_wiki_url = "https://en.wikipedia.org/wiki/Special:Random"
headers = {
'User-Agent': 'MyWikiRandom/1.0 (https://provide/a/website; provide.an@email.com)'
}
To improve readability and analysis, save content as readable text and not HTML. To do this, I use beautifulsoup4, a popular Python package to parse and extract text from HTML. In a future article, I’ll share other ways to clean this content further. For now, we use the get_text() method of beautifulsoup4 (also known as bs4, because obviously you know who’s bs number one) to extract text content from an HTML file.
Finally, we need to associate the content with the “source” we defined earlier. For this purpose, the client object associated with RAM comes in useful. This is the reason you see
from sasram.source import SourceClient
as part of the code. We thus write the content out to a file and save the file to the client object which symbolises the source, through the following two lines of code.
with open("/tmp/Wiki_of_the_Day.txt", "w") as f:
f.write(soup.get_text())
client.save_file("/tmp/Wiki_of_the_Day.txt")
Note that I keep my filename static, i.e. I always call it Wiki_of_the_Day.txt. This makes every automated run of this code overwrite the same file and assures of today’s (or to be precise, the most recent) random Wikipedia article.
Our final code looks like this:
from sasram.source import SourceClient
import requests
from bs4 import BeautifulSoup
filename = "Wiki_of_the_Day.txt"
random_wiki_url="https://en.wikipedia.org/wiki/Special:Random"
headers = {
'User-Agent': 'MyWikiRandom/1.0 (https://github.com/SundareshSankaran; sundaresh.sankaran@gmail.com)'
}
def exec(client: SourceClient):
resp = requests.get(url=random_wiki_url, headers=headers)
soup = BeautifulSoup(resp.content,'html.parser')
with open(f"/tmp/{filename}", "w") as f:
f.write(soup.get_text())
client.save_file(f"/tmp/{filename}")
Navigate to the Info tab and go to the Schedule field. Here, enter a cron tab expression to schedule the code to run at 2 a.m. every day. The cron tab expression to use is 0 2 * * *. This way, the article is ready for you every morning when you sit at your desk, coffee in hand.
Save the definition and open to the Jobs tab at the bottom of the Sources page. If you are like me, you do not want to wait until 2 a.m. for the job to run. Manually trigger the run through the icon on the top right side of the Jobs section. Move over to the Files section and you now notice a file with the name Wiki_of_the_day.txt appear. You can preview or download the file through the respective icons that appear on the top right-hand side of this section.
I need RAM to follow a workflow which vectorises this data upon update. Vectorising refers to converting text to points in a number space along a fixed number of dimensions. This vectorized data is stored in a collection for later retrieval and query through a candidate LLM. All these are controlled by a configuration specific to a collection.
You can create more than one configuration based on need. In this case, I go with just one, the champion configuration. To create a collection, go to the collections tab, and select new collection. Provide details such as the name and description, and importantly, select your Wiki_of_the_day (or whatever you named it) as a source to be used in the collection.
Next, create a configuration within a collection. This involves multiple steps, such as choosing an embedding model, a vector database, LLM to be used for query, filters for document types, and parameters governing vectorisation, such as chunking strategy.
It’s important to select the correct Configuration Update Strategy. This affects how changes to a source document are handled in the collection. Choice of an appropriate strategy (whether disabled, Append, Append and sync, or Append, sync and delete) depends on the use case. In this case, my purpose is to have a Wiki of the Day, i.e. allow for the document to change its contents every day. I therefore choose the “Append and sync” (the “Append, Sync and Delete” is also applicable) strategy so that my document collection always reflects the most recent updates. Refer here for an illustration on how to these strategies work.
A small road bump to look out for, regarding user experience, is that the distinction between a collection and a configuration is still not very intuitive, and you may find yourself missing some parameters. The LLM tab is especially tricky (remember to click an additional save button after choosing the LLM). Refer documentation here (https://go.documentation.sas.com/doc/en/ragntmgrcdc/default/ragntmgrug/titlepage.htm?fromDefault=) in case of any questions.
We want daily changes in the source file to automatically reflect in the collection. For this purpose, do the following:
This automatically triggers a vectorisation job every time a source file is created or updated and keeps the collection in sync with the source.
It’s time to query your collection through a chat interface, powered by an LLM using Retrieval Augmented Generation.
A pedantic approach would suggest you ideally create an Agent to orchestrate this chat experience, but, for this simple example, I subscribe to the principle that an “Agent is as Agent does”. For my use case, the Assistant interface in RAM performs the role of an agent.
To interact with your collection, go to the Chat tab, select your configuration and an available LLM from your configuration, and start chatting. Need to break the ice? Simply type,
“What’s today’s article all about?”
And sure enough, after a short while, you get a response.
Don’t you notice? It’s working already! I was ready to spend the rest of my life believing the Hustler was just a Paul Newman movie. Now, I know much better!
As mentioned earlier, this was a simple and fun example to get you started (and give your brains a bit of daily food for thought). However, you can extend this project in multiple ways. For example,
These are some initial thoughts and expect some follow up articles on the same. Feel free to let me know your thoughts and ideas through email or the comments section.
Great read! 🙂
Thank you, @duras
Great content @Sundaresh1 ! Thanks for sharing. The 1992 World Aquatics Championships ... well, learning never stops ! My Wiki of the day was about South Sandwich Islands 😉, what do you say ?
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.