Stocking the Library: Local Files, Custom Scripts, and Git Sources in SAS® Retrieval Agent Manager

A knowledge library is only as good as what's in it. In this third post of the series, we follow Ramona, our EMS Education Director, as she brings content into SAS® Retrieval Agent Manager using three different source types: local files, custom Python scripts, and Git repositories. Along the way, we unpack what sources and collections actually are and why the distinction matters.

From Empty Shelves to a Working Library

In the previous post, Ramona configured the three pillars of her Retrieval-Augmented Generation (RAG) system: Large Language Models (LLMs), embedding models, and a vector database. The infrastructure is in place. But an empty library, no matter how well-organized, cannot answer a single question.

Now it's time to stock the shelves.

The building block is straightforward: a source is where raw content comes from: PDFs, research papers, reference documents. A collection groups one or more sources into a searchable, securable unit. Think of sources as acquisition channels and collections as labeled shelves: everything on a shelf shares a common topic and a common access policy. Sources bring content in; collections organize it for use.

SAS Retrieval Agent Manager supports three ways to bring documents in:

Local: upload files directly from your machine.
Custom: run a Python script that fetches and saves content automatically.
Git: point the system at a repository and let it pull files from there.

Each type serves a different scenario.

Watch the full demo (4–5 min), then read on for a deeper look at each source type.

(view in My Videos)

Local Files

The simplest option. Ramona creates a source named Protocols, sets the type to Local, and uploads her PDFs: updated protocol documents from her Medical Director and guidelines from the American Heart Association.

As of the 2026.01 release, local sources accept many file types such as PDF, TXT, CSV, and SAS datasets. Local files are best suited for controlled, curated content that changes infrequently. If Ramona's protocols are revised quarterly, a manual upload is perfectly reasonable. For content that changes daily or weekly, the next two options remove that burden.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Custom Python Scripts

For content that evolves continuously, like recent research literature, Ramona creates a Custom source. This type accepts a zip package containing two files:

run.py — the script that fetches content.
requirements.txt — the list of Python packages the script depends on.

Her script targets arXiv, the open research archive, querying it with medical RAG-related search terms, downloading matching papers as PDFs, and saving them to the source automatically.

run.py:

from datetime import date, datetime, timedelta, timezone
from typing import List, Dict, Optional, Union
import os
import requests
import arxiv

# ============================================================
# 1️⃣ Entry point for Custom Source execution
# ============================================================
def exec(client):
    """
    Custom source entry point.
    Fetches recent arXiv research papers and saves them to the source.
    """
    # You can change this query to any topic of interest
    queries = [
        'all:"retrieval augmented generation medicine"',
        'all:"retrieval augmented generation medical question answering"',
        'all:"ExpertRAG"',
        'all:"Expert-CoT"',
        'all:"RAG medicine"',
        'all:"RAG medical"',
        'all:"medical RAG evaluation"',
    ]

    category_list = None
    print(f"🔍 Starting arXiv fetch for '{queries}' on {date.today()}")

    pdf_files = []
    for q in queries:
        pdf_files.extend(
            collect_research_papers(
                query=q,
                days_back=365,
                max_results=5,
                category=category_list,
            )
        )

    # Fallback: if less than target results, retry with broader queries
    queries_extended = [
        '("retrieval augmented" OR "retrieval-augmented") AND (medical OR clinical)',
        '(retrieval AND "large language model") AND (medical OR clinical)',
        '("medical question answering" AND retrieval)',
        '("clinical decision support" AND retrieval)',
        '("knowledge augmented" AND medical)',
        '("evidence grounded" AND medical)',
    ]
    if len(pdf_files) < 3: print(f"⚠️ Only {len(pdf_files)} files found, retrying with broader queries...") for q in queries_extended: pdf_files.extend( collect_research_papers( query=q, days_back=365, max_results=1, category=None, ) ) for f in pdf_files: client.save_file(f) print("✅ Research PDFs saved to source successfully.") # ============================================================ # 2️⃣ Main orchestration # ============================================================ def collect_research_papers( query: str, days_back: int = 365, max_results: int = 10, category: Optional[Union[str, List[str]]] = None, ) -> List[str]:
    """
    Fetch and download recent arXiv papers for a given query.
    Returns a list of local file paths (PDFs) saved locally.
    """
    papers = fetch_recent_arxiv_papers(
        query=query,
        max_results=max_results,
        days_back=days_back,
        category=category,
    )
    print(f"Found {len(papers)} papers for '{query}'.")

    saved_files = []
    for p in papers:
        print(f"{p['published'][:10]} — {p['title']} — {p['categories']}")
        print(f"Summary: {p['summary']}\n")

        if p["pdf_url"]:
            path = download_arxiv_pdf(p["pdf_url"])
            if path:
                saved_files.append(path)
    return saved_files


# ============================================================
# 3️⃣ Core data-fetching functions
# ============================================================
def arxiv_to_dict(result: arxiv.Result) -> Dict:
    """Convert an arxiv.Result object into a serializable dictionary."""
    return {
        "arxiv_id": result.entry_id.split("/")[-1],
        "title": result.title.strip(),
        "summary": result.summary.strip(),
        "published": result.published.isoformat(),
        "updated": result.updated.isoformat(),
        "authors": [a.name for a in result.authors],
        "primary_category": result.primary_category.strip(),
        "categories": result.categories,
        "pdf_url": result.pdf_url,
        "comment": getattr(result, "comment", None),
        "doi": getattr(result, "doi", None),
    }


def fetch_recent_arxiv_papers(
    query: str,
    max_results: int = 10,
    days_back: int = 365,
    category: Optional[Union[str, List[str]]] = None,
) -> List[Dict]:
    """
    Fetch recent arXiv papers matching a query and optional category or categories.
    Returns a list of dictionaries with metadata for each paper.
    """
    cutoff_date = datetime.now(timezone.utc) - timedelta(days=days_back)

    # ✅ Use the query exactly as passed in (no double wrapping)
    full_query = query.strip()
    if category:
        if isinstance(category, str):
            full_query += f' AND cat:{category}'
        elif isinstance(category, (list, tuple)):
            cats_query = " OR ".join(f"cat:{c}" for c in category)
            full_query += f" AND ({cats_query})"

    print(f"🔎 Querying arXiv with: {full_query}")

    search = arxiv.Search(
        query=full_query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending,
    )

    client = arxiv.Client()
    results = []
    for result in client.results(search):
        if result.published >= cutoff_date:
            results.append(arxiv_to_dict(result))
    return results


# ============================================================
# 4️⃣ PDF Downloading
# ============================================================
def download_arxiv_pdf(pdf_url: str, output_dir: str = "/tmp") -> Optional[str]:
    """
    Download a PDF from arXiv given its URL and save it locally.
    Returns the local file path, or None on failure.
    """
    os.makedirs(output_dir, exist_ok=True)
    filename = pdf_url.split("/")[-1] + ".pdf"
    filepath = os.path.join(output_dir, filename)

    try:
        response = requests.get(pdf_url, timeout=15)
        response.raise_for_status()
        with open(filepath, "wb") as f:
            f.write(response.content)
        print(f"✅ Saved PDF: {filepath}")
        return filepath
    except Exception as e:
        print(f"⚠️ Failed to download {pdf_url}: {e}")
        return None

requirements.txt:

arxiv
requests
urllib3==2.5.0  # Pin to match sasram-pylib requirements in pod (avoids SSL/validation issues)

The Fallback Pattern

One detail worth borrowing for any custom source: the script includes a fallback. If fewer than three papers are found with the primary queries, it retries automatically with broader terms. In an automated pipeline, silent failure is worse than imperfect results.

Let the Scheduler Do the Work

On the Scheduler tab, the source can run on demand or on a schedule:

Fixed schedule: every day at 8 a.m.
Cron expression: a compact string like 0 8 * * 1 — "8:00 a.m. every Monday." Cryptic at a glance, but any cron generator translates between human language and the expression in seconds.

Once scheduled, new research papers appear in Ramona's library without anyone remembering to upload them. And if arXiv is temporarily unavailable (it does happen), she can fall back to a local source and upload the PDFs directly.

Git Repositories

The third option connects SAS Retrieval Agent Manager to a Git repository. Ramona points the system at a public GitHub repository hosting clinical guidelines maintained by her team. The system fetches the files directly.

When the repository is updated, she triggers a Git pull. Content refreshes without re-uploading anything. There is one authoritative copy, and the library stays in sync with whatever the team publishes. Git sources are best suited for team-managed, version-controlled content like technical documentation, policy libraries, shared reference materials.

Collections

Sources are the raw material.

A collection is more than an organizational folder. It is the smallest securable object in SAS Retrieval Agent Manager, the unit at which access is granted or restricted. One collection can include multiple sources. It is also where downstream configuration happens: the LLMs used for querying, evaluation, and agents are all assigned at the collection level. Get the collection boundaries right early; restructuring them later is the harder path.

Conclusion

Three source types, one decision framework:

Source type	Best for
Local	Stable, curated content updated infrequently
Custom script	Continuously changing content, automated retrieval
Git	Team-managed, version-controlled documents

Collections are the organizational and security boundary that makes sources usable. Content quality and how it is organized shape retrieval quality more than almost any other variable. Get these right, and the vectorization step that follows has clean, well-structured input to work with.

Next Steps

In the next post, Ramona configures and runs vectorization jobs. Vectorization turns documents into searchable vectors. If adding sources is stocking the shelves, vectorization is indexing every book so the catalog becomes searchable.

Ready to try it yourself? The hands-on workshop is available now: A Smarter Way to Unlock Unstructured Data: SAS® Retrieval Agent Manager in Action.

Additional Resources

Posts in This Series

Post 1 — The RAG Workflow: A Day in the Life of Ramona.
Post 2 - Setting Up LLMs, Embeddings, and Vector Databases in SAS® Retrieval Agent Manager.

Official SAS Documentation

SAS Retrieval Agent Manager documentation.

For further guidance, reach out for assistance.

Find more articles from SAS Global Enablement and Learning here.