From Documents to Vectors: Configuring and Vectorizing a Collection in SAS® Retrieval Agent Manager

A library full of raw documents or PDFs can't answer a question. Vectors can. This post unpacks the thinking behind the four configuration decisions that determine whether your RAG pipeline retrieves brilliantly or fumbles in the dark.

From Stocked Shelves to a Searchable Index

Ramona has a well-stocked library made up of protocols, research papers, clinical guidelines. What she doesn't have yet is something queryable by meaning.

That's the job of vectorization. An embedding model reads each piece of text and converts it into coordinates in a high-dimensional space where meaning is geometry: similar concepts cluster together, unrelated ones drift apart. Ask a question and the system finds the closest coordinates. That's semantic search. That's what separates a knowledge base from a fancy file share.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The above image was generated using gpt-image-2 (api-version=2025-04-01-preview).

Before running the job, though, Ramona needs to configure how it runs. Every setting is a decision with real consequences for retrieval quality. Let's unpack each one.

Watch the full demo (~9 min) for the step-by-step walkthrough, then read on for the reasoning behind each decision.

(view in My Videos)

What Is a Configuration?

One rule before diving in: a saved configuration cannot be edited.

This surprises people, but it's intentional. Retrieval quality is empirical: you configure, evaluate, adjust, and compare. If configurations were mutable, you'd have no reliable baseline to compare against. Immutability makes experimentation reproducible. Plan from the start to have more than one configuration; that's the workflow, not a workaround.

Basics: Choosing an Embedding Model

The Embedding Model: Choosing Your Coordinate System

The embedding model is the most foundational choice in the whole configuration. Everything downstream, chunking, storage, retrieval, operates inside the coordinate system this model defines.

Different models encode meaning differently, and switching models later means re-vectorizing from scratch. The coordinate systems are incompatible. So this is a commit-early, validate-before-scaling decision.

The trade-off is straightforward:

Embedding model size	Precision	Speed	Cost
Smaller (e.g., all-MiniLM-L6-v2)	Lower	Faster	Lower
Larger (e.g., granite-embedding-125m-english)	Higher	Slower	Higher

For a medical RAG application where retrieval accuracy is patient-safety-adjacent, Ramona leans toward precision and picks, for example, a larger embedding model. For a high-throughput internal FAQ, the lighter model might be the smarter call. Neither is universally right. Choose for your use case, then let evaluations confirm it.

One housekeeping note on update strategy: Append, sync, and delete keeps the vector database clean as source documents change. Without it, deleted or updated documents leave ghost vectors behind. Stale content in a retrieval system isn't just noisy, it can be dangerous for high precision data, such as clinical protocols.

Settings: Chunking Strategy

Chunking: The Art of the Useful Slice

Embedding models don't process whole documents. They process chunks: the individual segments that become retrieval units. When a query arrives, the system finds the most relevant chunks, not documents. This makes chunking a hidden lever on retrieval quality.

Think of a long document as a loaf of bread. Chunking is slicing it:

Chunk size is how thick each slice is.
Overlap means adjacent slices share a small edge. Nothing important gets lost at a boundary and there's some tiny context in the transition from one chunk to another.
Too thick: each chunk carries multiple ideas and retrieval gets imprecise.
Too thin: context evaporates and the model can't reason from what it retrieves. Both hurt.

A reliable starting heuristic: set chunk size to about 75% of the model's token limit, and overlap to about 20% of chunk size. For a model with a 512 token limit, that gives 384 tokens and 77 tokens respectively (exactly what Ramona uses).

The hard rule: never exceed the model's token limit. Most embedding services silently truncate the overflow rather than throwing an error. The document looks processed, the vectors are generated, but part of every oversized chunk is simply gone. Silent failures are the worst kind because you won't know to investigate.

For content type, nudge accordingly: denser technical text (tables, dosages, procedures) benefits from smaller chunks; narrative or conceptual text can handle larger ones.

Text Extraction: OCR and Tables

Embedding models expect clean text. Real life documents (medical or other) rarely provide it.

Scanned pages, figures with text baked into images, dosage tables that collapse into meaningless number strings when extracted naively; this is the norm, not the exception. The text extraction configuration is where you deal with that reality.

OCR (Optical Character Recognition) converts images and scanned pages into machine-readable text before chunking ever runs.

PaddleOCR handles complex layouts, multi-column text, and variable scan quality well (a natural fit for clinical documents).
Tesseract is the lighter alternative for simpler, cleaner sources.

Table extraction reconstructs grid structure into text the model can actually process. A number without its row and column headers is just a number. Preserve the structure, and suddenly those protocol tables become retrievable and meaningful.

Configuring LLMs for the Collection

LLM Assignment: Don't Let the Model Grade Its Own Homework

The LLMs tab is where Ramona assigns models for answering queries and running evaluations. The query side is intuitive, choose a capable model for users, a lighter one for data generation. The evaluation side is where the thinking gets interesting.

The split may look like this:

Role	Model
User Eval	gpt-4.1
Auto Eval — Data generation	gpt-4o-mini
Auto Eval — Critic	gpt-4.5
Auto Eval — Eval	mistral

The Critic role is the hardest job in the evaluation chain. It has to:

Detect subtle factual errors or hallucinations.
Judge whether an answer is faithful to the retrieved context, not just plausible-sounding.
Catch cases where the model retrieved the right chunk but answered from its parametric memory instead.

That last one is particularly tricky. A model of similar capability to the one being evaluated may not reliably catch it. There's a real risk the Critic sees a fluent, confident answer and rates it well without noticing the grounding failure underneath. This is the strongest argument for assigning a genuinely more capable model for the Critic LLM.

Another reason for the separation of roles: LLMs exhibit self-preference bias. A model asked to score its own output will tend to be generous. That's the AI equivalent of grading your own exam. Separating the model that generates answers from the model that evaluates them produces more honest results.

The cost logic is sound too. Data generation is high-volume and tolerates imperfection, so the lighter model earns its place there. Critique and scoring require sharper reasoning, so the stronger model does that work.

Treat this as a starting point. Adjust based on what your evaluation metrics actually show.

Read Evaluating the Performance of Retrieval Augmented Generation Pipelines using the Ragas Framework for a more detailed analysis.

Running the Vectorization Job

Once configuration and LLM assignment are done, the vectorization job itself is mechanical: read documents, clean text, split into chunks, embed, store in the vector database (PostgreSQL or Weaviate). For a small collection, 10–20 minutes. Heavier OCR workloads take longer: image processing and extraction from tables is usually the bottleneck.

The good news: evaluation test setup doesn't have to wait for the job to finish. The two workstreams run independently. No reason to just watch a progress bar.

Conclusion

Four decisions drive configuration quality:

Embedding model: choose for your precision vs. latency trade-off, validate before scaling. Switching later means starting over.
Chunking: 75% of token limit for size, 20% for overlap. Never silently exceed the limit. Tune for content density.
Text extraction: if your documents have scanned pages or tables, OCR and table extraction aren't optional. Unprocessed structure is a silent retrieval tax.
LLM separation: keep generation and evaluation apart. Bias is real, even in machines.

Next Steps

Next up: evaluations. Ramona runs both user-driven and automated tests to find out whether the pipeline actually performs.

Ready to build your own? The hands-on workshop is waiting: A Smarter Way to Unlock Unstructured Data: SAS® Retrieval Agent Manager in Action.

Posts in This Series

Official SAS Documentation

SAS Retrieval Agent Manager documentation.

For further guidance, reach out for assistance.

Find more articles from SAS Global Enablement and Learning here.