A library full of raw documents or PDFs can't answer a question. Vectors can. This post unpacks the thinking behind the four configuration decisions that determine whether your RAG pipeline retrieves brilliantly or fumbles in the dark.
Ramona has a well-stocked library made up of protocols, research papers, clinical guidelines. What she doesn't have yet is something queryable by meaning.
That's the job of vectorization. An embedding model reads each piece of text and converts it into coordinates in a high-dimensional space where meaning is geometry: similar concepts cluster together, unrelated ones drift apart. Ask a question and the system finds the closest coordinates. That's semantic search. That's what separates a knowledge base from a fancy file share.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The above image was generated using gpt-image-2 (api-version=2025-04-01-preview).
Before running the job, though, Ramona needs to configure how it runs. Every setting is a decision with real consequences for retrieval quality. Let's unpack each one.
Watch the full demo (~9 min) for the step-by-step walkthrough, then read on for the reasoning behind each decision.
One rule before diving in: a saved configuration cannot be edited.
This surprises people, but it's intentional. Retrieval quality is empirical: you configure, evaluate, adjust, and compare. If configurations were mutable, you'd have no reliable baseline to compare against. Immutability makes experimentation reproducible. Plan from the start to have more than one configuration; that's the workflow, not a workaround.
The Embedding Model: Choosing Your Coordinate System
The embedding model is the most foundational choice in the whole configuration. Everything downstream, chunking, storage, retrieval, operates inside the coordinate system this model defines.
Different models encode meaning differently, and switching models later means re-vectorizing from scratch. The coordinate systems are incompatible. So this is a commit-early, validate-before-scaling decision.
The trade-off is straightforward:
| Embedding model size | Precision | Speed | Cost |
| Smaller (e.g., all-MiniLM-L6-v2) | Lower | Faster | Lower |
| Larger (e.g., granite-embedding-125m-english) | Higher | Slower | Higher |
For a medical RAG application where retrieval accuracy is patient-safety-adjacent, Ramona leans toward precision and picks, for example, a larger embedding model. For a high-throughput internal FAQ, the lighter model might be the smarter call. Neither is universally right. Choose for your use case, then let evaluations confirm it.
One housekeeping note on update strategy: Append, sync, and delete keeps the vector database clean as source documents change. Without it, deleted or updated documents leave ghost vectors behind. Stale content in a retrieval system isn't just noisy, it can be dangerous for high precision data, such as clinical protocols.
Chunking: The Art of the Useful Slice
Embedding models don't process whole documents. They process chunks: the individual segments that become retrieval units. When a query arrives, the system finds the most relevant chunks, not documents. This makes chunking a hidden lever on retrieval quality.
Think of a long document as a loaf of bread. Chunking is slicing it:
A reliable starting heuristic: set chunk size to about 75% of the model's token limit, and overlap to about 20% of chunk size. For a model with a 512 token limit, that gives 384 tokens and 77 tokens respectively (exactly what Ramona uses).
The hard rule: never exceed the model's token limit. Most embedding services silently truncate the overflow rather than throwing an error. The document looks processed, the vectors are generated, but part of every oversized chunk is simply gone. Silent failures are the worst kind because you won't know to investigate.
For content type, nudge accordingly: denser technical text (tables, dosages, procedures) benefits from smaller chunks; narrative or conceptual text can handle larger ones.
Embedding models expect clean text. Real life documents (medical or other) rarely provide it.
Scanned pages, figures with text baked into images, dosage tables that collapse into meaningless number strings when extracted naively; this is the norm, not the exception. The text extraction configuration is where you deal with that reality.
OCR (Optical Character Recognition) converts images and scanned pages into machine-readable text before chunking ever runs.
Table extraction reconstructs grid structure into text the model can actually process. A number without its row and column headers is just a number. Preserve the structure, and suddenly those protocol tables become retrievable and meaningful.
LLM Assignment: Don't Let the Model Grade Its Own Homework
The LLMs tab is where Ramona assigns models for answering queries and running evaluations. The query side is intuitive, choose a capable model for users, a lighter one for data generation. The evaluation side is where the thinking gets interesting.
The split may look like this:
| Role | Model |
| User Eval | gpt-4.1 |
| Auto Eval — Data generation | gpt-4o-mini |
| Auto Eval — Critic | gpt-4.5 |
| Auto Eval — Eval | mistral |
The Critic role is the hardest job in the evaluation chain. It has to:
That last one is particularly tricky. A model of similar capability to the one being evaluated may not reliably catch it. There's a real risk the Critic sees a fluent, confident answer and rates it well without noticing the grounding failure underneath. This is the strongest argument for assigning a genuinely more capable model for the Critic LLM.
Another reason for the separation of roles: LLMs exhibit self-preference bias. A model asked to score its own output will tend to be generous. That's the AI equivalent of grading your own exam. Separating the model that generates answers from the model that evaluates them produces more honest results.
The cost logic is sound too. Data generation is high-volume and tolerates imperfection, so the lighter model earns its place there. Critique and scoring require sharper reasoning, so the stronger model does that work.
Treat this as a starting point. Adjust based on what your evaluation metrics actually show.
Read Evaluating the Performance of Retrieval Augmented Generation Pipelines using the Ragas Framework for a more detailed analysis.
Once configuration and LLM assignment are done, the vectorization job itself is mechanical: read documents, clean text, split into chunks, embed, store in the vector database (PostgreSQL or Weaviate). For a small collection, 10–20 minutes. Heavier OCR workloads take longer: image processing and extraction from tables is usually the bottleneck.
The good news: evaluation test setup doesn't have to wait for the job to finish. The two workstreams run independently. No reason to just watch a progress bar.
Four decisions drive configuration quality:
Next up: evaluations. Ramona runs both user-driven and automated tests to find out whether the pipeline actually performs.
Ready to build your own? The hands-on workshop is waiting: A Smarter Way to Unlock Unstructured Data: SAS® Retrieval Agent Manager in Action.
For further guidance, reach out for assistance.
Find more articles from SAS Global Enablement and Learning here.
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.