Discover how to wire the foundation of a Retrieval-Augmented Generation (RAG) system in SAS® Retrieval Agent Manager. In this second post of the series, we follow Ramona, our EMS Education Director, as she connects Large Language Models, embedding models, and vector databases so her knowledge library can actually think, search, and answer.
In the first post, we watched Ramona's full day unfold: ingesting documents, chatting with her knowledge library, and automating updates. But none of that works without a solid foundation.
Before she can ask a single question, Ramona needs three things in place:
Think of it this way. An LLM without an embedding model is a brilliant librarian who can't read the catalog. An embedding model without a vector database is like a pile of books with no shelves. And a vector database without an LLM is a warehouse full of boxes that nobody can open.
Configuration is where you bring all three together.
All configuration happens in the Settings panel of SAS Retrieval Agent Manager. The post is based on the Stable 2026.01 release.
Here is how these pieces, Large Language Models (LLMs),embedding models and vector databases flow together in practice:
See it in action: In this short demo, Ramona walks through the Settings panel to configure all three pillars: LLMs, embedding models, and vector databases.
LLMs are the conversational engine of the RAG workflow. When Ramona types a question, an LLM reads the retrieved document chunks, reasons over them, and produces an answer. But LLMs do more than chat. In SAS Retrieval Agent Manager, different LLMs can play different roles:
Using separate models for generation and evaluation avoids a subtle trap: a model scoring its own homework. If the same LLM writes and grades the answer, biased self-assessment can creep in. Ramona assigns gpt-4.1 for chat and evaluation critique, and gpt-4o-mini for synthetic test-data generation, each model playing to its strengths. For more information about LLM “role separation” read Evaluating the Performance of Retrieval Augmented Generation Pipelines using the RAGAS Framework.
SAS Retrieval Agent Manager follows a "bring your own LLM" philosophy. You are not locked into a single vendor. The application supports:
Ollama is a new project that allows you to use LLMs like llama 2 or mistral on your local computer or Kubernetes cluster.
Local models like Llama 3 via Ollama can run inside the cluster, but without GPU acceleration, inference is too slow for interactive use (roughly 1–2 tokens per second). For Ramona's use case, where fast, reliable answers matter, cloud-hosted models are the pragmatic choice today.
If LLMs are the brains, embedding models are the translators. They convert human-readable text into high-dimensional numerical vectors, that is arrays of numbers that capture meaning. Two sentences that say the same thing in different words end up close together in the vector space. That proximity is what makes semantic search work.
Imagine the sentence "The patient is not breathing" becomes something like [1.0, 2.0, ... ], a list of 1,536 numbers (dimensions). A similar sentence like "The victim has stopped respiration" would produce a nearly identical list. That numerical closeness is how the system knows they mean the same thing, even though they share almost no words.
Think of dimensions such as the number of traits you use to describe something. If you describe a condition using only two traits: human physiology and action, many different conditions will look identical. Add 1,536 traits and suddenly each condition and situation is distinct. That's what higher dimensions do for text: they let the system tell apart passages that are similar but not identical.
SAS Retrieval Agent Manager includes roughly eight open-source embedding models deployed in its Kubernetes cluster. They range from the lightweight all-MiniLM-L6-v2 (about 22 million parameters, 384-dimensional vectors, very fast) to the robust ibm-granite-embedding-278M-multilingual (278 million parameters, very high precision across languages).
You can also add external models from Azure OpenAI or Hugging Face. Azure models run remotely. Hugging Face models are downloaded and served locally via the Text Embeddings Inference (TEI) server. TEI is a toolkit for deploying and serving open-source text embeddings and sequence classification models.
Here’s a comparison table of five representative embedding models:
| Model | Language | Size (approx.) | Precision | Recommended Use Case |
| all-MiniLM-L6-v2 | English | ~22M | Low–Medium | Lightweight search, edge deployments |
| ibm-granite-embedding-125M-english | English | ~125M | High | English RAG with strict accuracy needs |
| ibm-granite-embedding-278M-multilingual | Multilingual | ~278M | Very High | Mission-critical multilingual retrieval |
| text-embedding-3-small (Azure OpenAI) | Multilingual | Cloud | High | Cost-effective cloud-based RAG |
| nomic-embed-text-v2-moe | English | Mixture of Experts (MoE) | Very High | Large-scale RAG, diverse corpora |
Source: Best Open-Source Embedding Models for RAG and other sources.
Every embedding model is defined by two key parameters:
Chunk size (configured separately during ingestion) is the actual size of each text piece you send. In practice, you chunk documents well below this limit to avoid "topic dilution," where one vector tries to represent too many ideas.
For example, if a single chunk contains an entire 3-page protocol covering airway management, medication dosages, and patient transport procedures, the resulting vector becomes a blurry average of all three topics. When Ramona searches for "epinephrine dosage," that chunk might not surface because its vector is pulled in too many directions. Smaller chunks, and ideally one idea per chunk, produce sharper vectors.
When Ramona adds the Azure OpenAI text-embedding-3-small model, the system infers 8191 max tokens after she clicks Verify Model Existence.
She knows what she wants: high-precision retrieval.
Small chunks + high dimensions = high-precision retrieval. Therefore, she changes the values to:
If Ramona's priority is speed and cost efficiency over maximum precision, she would flip both settings to their lowest practical values:
A freshly added embedding model starts in an unpublished state. Before it can appear in any downstream configuration, collection, vectorization job, or evaluation, Ramona must click Publish. This workflow prevents half-configured models from being accidentally used in a pipeline.
Models can also be deprecated: existing configurations keep working, but no new ones can reference a deprecated model. This is useful when migrating to a better model without breaking what already runs.
Vectors need a home. In SAS Retrieval Agent Manager, that home is called a destination, a vector store where embeddings and their metadata land after a vectorization job.
Why not a regular database? Traditional databases are built for exact lookups: "find every record where status = active." They match on precise values.
A vector database does something fundamentally different, it searches by meaning. Ask it to "find the five passages closest in meaning to this query," and it compares numerical vectors to find the best semantic matches, even when the wording differs completely.
This operation is called approximate nearest neighbor (ANN) search. It requires specialized indexing that relational databases were never designed for, and it's the engine that makes RAG retrieval work.
Two options are available in SAS Retrieval Agent Manager:
| Feature | PGVector | Weaviate |
| Purpose-built for vectors | No (extension on relational DB) | Yes |
| Hybrid search | Limited (SQL filters only) | Yes (semantic + keyword) |
| Scalability | Vertical; practical up to millions | Horizontal; scales to billions |
| Latency at scale | Higher for large ANN queries | Sub-100 ms |
| Ease of adoption | Familiar SQL ecosystem | New system, new APIs |
| Best for | Small/medium datasets, unified storage | RAG and semantic search at scale |
Back to Ramona: For her regional EMS library, containing dozens of protocols, not millions of records, PGVector is more than enough. If her network later scales to a statewide system with hundreds of agencies and tens of thousands of documents, Weaviate offers a clear upgrade path without rearchitecting the rest of the pipeline.
From the broader RAG community, here are a few practices that consistently pay off:
Configuration is not glamorous, but it is the foundation everything else stands on. In this post, we walked through the three pillars Ramona sets up before her knowledge library can answer a single question:
Get these right, and the rest of the RAG workflow, ingestion, evaluation, chat, agents, automation, has a reliable base to build on.
In the next post, we will follow Ramona into document ingestion and vectorization: how she organizes sources into collections, chooses chunking strategies, and kicks off vectorization jobs. If configuration is the foundation, ingestion is the first floor.
Ready to try it yourself? The hands-on workshop is available now: A Smarter Way to Unlock Unstructured Data: SAS Retrieval Agent Manager in Action.
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.