A knowledge library is only as good as what's in it. In this third post of the series, we follow Ramona, our EMS Education Director, as she brings content into SAS® Retrieval Agent Manager using three different source types: local files, custom Python scripts, and Git repositories. Along the way, we unpack what sources and collections actually are and why the distinction matters.
In the previous post, Ramona configured the three pillars of her Retrieval-Augmented Generation (RAG) system: Large Language Models (LLMs), embedding models, and a vector database. The infrastructure is in place. But an empty library, no matter how well-organized, cannot answer a single question.
Now it's time to stock the shelves.
The building block is straightforward: a source is where raw content comes from: PDFs, research papers, reference documents. A collection groups one or more sources into a searchable, securable unit. Think of sources as acquisition channels and collections as labeled shelves: everything on a shelf shares a common topic and a common access policy. Sources bring content in; collections organize it for use.
SAS Retrieval Agent Manager supports three ways to bring documents in:
Each type serves a different scenario.
Watch the full demo (4–5 min), then read on for a deeper look at each source type.
The simplest option. Ramona creates a source named Protocols, sets the type to Local, and uploads her PDFs: updated protocol documents from her Medical Director and guidelines from the American Heart Association.
As of the 2026.01 release, local sources accept many file types such as PDF, TXT, CSV, and SAS datasets. Local files are best suited for controlled, curated content that changes infrequently. If Ramona's protocols are revised quarterly, a manual upload is perfectly reasonable. For content that changes daily or weekly, the next two options remove that burden.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
For content that evolves continuously, like recent research literature, Ramona creates a Custom source. This type accepts a zip package containing two files:
Her script targets arXiv, the open research archive, querying it with medical RAG-related search terms, downloading matching papers as PDFs, and saving them to the source automatically.
run.py:
from datetime import date, datetime, timedelta, timezone
from typing import List, Dict, Optional, Union
import os
import requests
import arxiv
# ============================================================
# 1️⃣ Entry point for Custom Source execution
# ============================================================
def exec(client):
"""
Custom source entry point.
Fetches recent arXiv research papers and saves them to the source.
"""
# You can change this query to any topic of interest
queries = [
'all:"retrieval augmented generation medicine"',
'all:"retrieval augmented generation medical question answering"',
'all:"ExpertRAG"',
'all:"Expert-CoT"',
'all:"RAG medicine"',
'all:"RAG medical"',
'all:"medical RAG evaluation"',
]
category_list = None
print(f"🔍 Starting arXiv fetch for '{queries}' on {date.today()}")
pdf_files = []
for q in queries:
pdf_files.extend(
collect_research_papers(
query=q,
days_back=365,
max_results=5,
category=category_list,
)
)
# Fallback: if less than target results, retry with broader queries
queries_extended = [
'("retrieval augmented" OR "retrieval-augmented") AND (medical OR clinical)',
'(retrieval AND "large language model") AND (medical OR clinical)',
'("medical question answering" AND retrieval)',
'("clinical decision support" AND retrieval)',
'("knowledge augmented" AND medical)',
'("evidence grounded" AND medical)',
]
if len(pdf_files) < 3: print(f"⚠️ Only {len(pdf_files)} files found, retrying with broader queries...") for q in queries_extended: pdf_files.extend( collect_research_papers( query=q, days_back=365, max_results=1, category=None, ) ) for f in pdf_files: client.save_file(f) print("✅ Research PDFs saved to source successfully.") # ============================================================ # 2️⃣ Main orchestration # ============================================================ def collect_research_papers( query: str, days_back: int = 365, max_results: int = 10, category: Optional[Union[str, List[str]]] = None, ) -> List[str]:
"""
Fetch and download recent arXiv papers for a given query.
Returns a list of local file paths (PDFs) saved locally.
"""
papers = fetch_recent_arxiv_papers(
query=query,
max_results=max_results,
days_back=days_back,
category=category,
)
print(f"Found {len(papers)} papers for '{query}'.")
saved_files = []
for p in papers:
print(f"{p['published'][:10]} — {p['title']} — {p['categories']}")
print(f"Summary: {p['summary']}\n")
if p["pdf_url"]:
path = download_arxiv_pdf(p["pdf_url"])
if path:
saved_files.append(path)
return saved_files
# ============================================================
# 3️⃣ Core data-fetching functions
# ============================================================
def arxiv_to_dict(result: arxiv.Result) -> Dict:
"""Convert an arxiv.Result object into a serializable dictionary."""
return {
"arxiv_id": result.entry_id.split("/")[-1],
"title": result.title.strip(),
"summary": result.summary.strip(),
"published": result.published.isoformat(),
"updated": result.updated.isoformat(),
"authors": [a.name for a in result.authors],
"primary_category": result.primary_category.strip(),
"categories": result.categories,
"pdf_url": result.pdf_url,
"comment": getattr(result, "comment", None),
"doi": getattr(result, "doi", None),
}
def fetch_recent_arxiv_papers(
query: str,
max_results: int = 10,
days_back: int = 365,
category: Optional[Union[str, List[str]]] = None,
) -> List[Dict]:
"""
Fetch recent arXiv papers matching a query and optional category or categories.
Returns a list of dictionaries with metadata for each paper.
"""
cutoff_date = datetime.now(timezone.utc) - timedelta(days=days_back)
# ✅ Use the query exactly as passed in (no double wrapping)
full_query = query.strip()
if category:
if isinstance(category, str):
full_query += f' AND cat:{category}'
elif isinstance(category, (list, tuple)):
cats_query = " OR ".join(f"cat:{c}" for c in category)
full_query += f" AND ({cats_query})"
print(f"🔎 Querying arXiv with: {full_query}")
search = arxiv.Search(
query=full_query,
max_results=max_results,
sort_by=arxiv.SortCriterion.SubmittedDate,
sort_order=arxiv.SortOrder.Descending,
)
client = arxiv.Client()
results = []
for result in client.results(search):
if result.published >= cutoff_date:
results.append(arxiv_to_dict(result))
return results
# ============================================================
# 4️⃣ PDF Downloading
# ============================================================
def download_arxiv_pdf(pdf_url: str, output_dir: str = "/tmp") -> Optional[str]:
"""
Download a PDF from arXiv given its URL and save it locally.
Returns the local file path, or None on failure.
"""
os.makedirs(output_dir, exist_ok=True)
filename = pdf_url.split("/")[-1] + ".pdf"
filepath = os.path.join(output_dir, filename)
try:
response = requests.get(pdf_url, timeout=15)
response.raise_for_status()
with open(filepath, "wb") as f:
f.write(response.content)
print(f"✅ Saved PDF: {filepath}")
return filepath
except Exception as e:
print(f"⚠️ Failed to download {pdf_url}: {e}")
return None
requirements.txt:
arxiv
requests
urllib3==2.5.0 # Pin to match sasram-pylib requirements in pod (avoids SSL/validation issues)
One detail worth borrowing for any custom source: the script includes a fallback. If fewer than three papers are found with the primary queries, it retries automatically with broader terms. In an automated pipeline, silent failure is worse than imperfect results.
On the Scheduler tab, the source can run on demand or on a schedule:
Once scheduled, new research papers appear in Ramona's library without anyone remembering to upload them. And if arXiv is temporarily unavailable (it does happen), she can fall back to a local source and upload the PDFs directly.
The third option connects SAS Retrieval Agent Manager to a Git repository. Ramona points the system at a public GitHub repository hosting clinical guidelines maintained by her team. The system fetches the files directly.
When the repository is updated, she triggers a Git pull. Content refreshes without re-uploading anything. There is one authoritative copy, and the library stays in sync with whatever the team publishes. Git sources are best suited for team-managed, version-controlled content like technical documentation, policy libraries, shared reference materials.
Sources are the raw material.
A collection is more than an organizational folder. It is the smallest securable object in SAS Retrieval Agent Manager, the unit at which access is granted or restricted. One collection can include multiple sources. It is also where downstream configuration happens: the LLMs used for querying, evaluation, and agents are all assigned at the collection level. Get the collection boundaries right early; restructuring them later is the harder path.
Three source types, one decision framework:
| Source type | Best for |
| Local | Stable, curated content updated infrequently |
| Custom script | Continuously changing content, automated retrieval |
| Git | Team-managed, version-controlled documents |
Collections are the organizational and security boundary that makes sources usable. Content quality and how it is organized shape retrieval quality more than almost any other variable. Get these right, and the vectorization step that follows has clean, well-structured input to work with.
In the next post, Ramona configures and runs vectorization jobs. Vectorization turns documents into searchable vectors. If adding sources is stocking the shelves, vectorization is indexing every book so the catalog becomes searchable.
Ready to try it yourself? The hands-on workshop is available now: A Smarter Way to Unlock Unstructured Data: SAS® Retrieval Agent Manager in Action.
For further guidance, reach out for assistance.
Find more articles from SAS Global Enablement and Learning here.
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.