Setting Up LLMs, Embeddings, and Vector Databases in SAS® Retrieval Agent Manager

Discover how to wire the foundation of a Retrieval-Augmented Generation (RAG) system in SAS® Retrieval Agent Manager. In this second post of the series, we follow Ramona, our EMS Education Director, as she connects Large Language Models, embedding models, and vector databases so her knowledge library can actually think, search, and answer.

The Three Pillars – Why Configuration Matters

In the first post, we watched Ramona's full day unfold: ingesting documents, chatting with her knowledge library, and automating updates. But none of that works without a solid foundation.

Before she can ask a single question, Ramona needs three things in place:

Large Language Models (LLMs): the brains that read, reason, and respond.
Embedding models: the translators that turn text into numerical vectors so the system can search by meaning, not just keywords.
Vector databases: the shelves where those vectors live, ready for lightning-fast retrieval.

Think of it this way. An LLM without an embedding model is a brilliant librarian who can't read the catalog. An embedding model without a vector database is like a pile of books with no shelves. And a vector database without an LLM is a warehouse full of boxes that nobody can open.

Configuration is where you bring all three together.

All configuration happens in the Settings panel of SAS Retrieval Agent Manager. The post is based on the Stable 2026.01 release.

How the Three Pillars Connect

Here is how these pieces, Large Language Models (LLMs),embedding models and vector databases flow together in practice:

See it in action: In this short demo, Ramona walks through the Settings panel to configure all three pillars: LLMs, embedding models, and vector databases.

(view in My Videos)

Pillar 1 — Large Language Models

LLMs are the conversational engine of the RAG workflow. When Ramona types a question, an LLM reads the retrieved document chunks, reasons over them, and produces an answer. But LLMs do more than chat. In SAS Retrieval Agent Manager, different LLMs can play different roles:

Chat LLMs generate answers to user queries.
Evaluation and Critic LLMs critique answers, perform evaluations, and score retrieval quality.
Data Generation LLMs generate synthetic evaluation test cases.

Using separate models for generation and evaluation avoids a subtle trap: a model scoring its own homework. If the same LLM writes and grades the answer, biased self-assessment can creep in. Ramona assigns gpt-4.1 for chat and evaluation critique, and gpt-4o-mini for synthetic test-data generation, each model playing to its strengths. For more information about LLM “role separation” read Evaluating the Performance of Retrieval Augmented Generation Pipelines using the RAGAS Framework.

Bring Your Own Model

SAS Retrieval Agent Manager follows a "bring your own LLM" philosophy. You are not locked into a single vendor. The application supports:

Amazon Bedrock
Azure OpenAI
Ollama (local, open-source models running inside the Kubernetes cluster)
OpenAI

sbxbot_1_11692__2_SAS_Retrieval_Agent_Manager_LLM_Config.png

Ollama is a new project that allows you to use LLMs like llama 2 or mistral on your local computer or Kubernetes cluster.

Local models like Llama 3 via Ollama can run inside the cluster, but without GPU acceleration, inference is too slow for interactive use (roughly 1–2 tokens per second). For Ramona's use case, where fast, reliable answers matter, cloud-hosted models are the pragmatic choice today.

Pillar 2 — Embedding Models

Embedding Models

If LLMs are the brains, embedding models are the translators. They convert human-readable text into high-dimensional numerical vectors, that is arrays of numbers that capture meaning. Two sentences that say the same thing in different words end up close together in the vector space. That proximity is what makes semantic search work.

Vectors

Imagine the sentence "The patient is not breathing" becomes something like [1.0, 2.0, ... ], a list of 1,536 numbers (dimensions). A similar sentence like "The victim has stopped respiration" would produce a nearly identical list. That numerical closeness is how the system knows they mean the same thing, even though they share almost no words.

sbxbot__11692_4_SAS_Retrieval_Agent_Manager_Vectors-1024x455.png

Dimensions

Think of dimensions such as the number of traits you use to describe something. If you describe a condition using only two traits: human physiology and action, many different conditions will look identical. Add 1,536 traits and suddenly each condition and situation is distinct. That's what higher dimensions do for text: they let the system tell apart passages that are similar but not identical.

What Ships Out of the Box

SAS Retrieval Agent Manager includes roughly eight open-source embedding models deployed in its Kubernetes cluster. They range from the lightweight all-MiniLM-L6-v2 (about 22 million parameters, 384-dimensional vectors, very fast) to the robust ibm-granite-embedding-278M-multilingual (278 million parameters, very high precision across languages).

Bring Your Own Embedding Model

You can also add external models from Azure OpenAI or Hugging Face. Azure models run remotely. Hugging Face models are downloaded and served locally via the Text Embeddings Inference (TEI) server. TEI is a toolkit for deploying and serving open-source text embeddings and sequence classification models.

sbxbot_1_11692_3_SAS_Retrieval_Agent_Manager_Embedding_Models_Config-1024x576.png

Comparison Table

Here’s a comparison table of five representative embedding models:

Model	Language	Size (approx.)	Precision	Recommended Use Case
all-MiniLM-L6-v2	English	~22M	Low–Medium	Lightweight search, edge deployments
ibm-granite-embedding-125M-english	English	~125M	High	English RAG with strict accuracy needs
ibm-granite-embedding-278M-multilingual	Multilingual	~278M	Very High	Mission-critical multilingual retrieval
text-embedding-3-small (Azure OpenAI)	Multilingual	Cloud	High	Cost-effective cloud-based RAG
nomic-embed-text-v2-moe	English	Mixture of Experts (MoE)	Very High	Large-scale RAG, diverse corpora

Source: Best Open-Source Embedding Models for RAG and other sources.

Two Numbers That Matter: Dimensions and Max Tokens

Every embedding model is defined by two key parameters:

Dimension is the length of each embedding vector (for example, 384, 768, or 1536). Higher dimensions capture more semantic nuance, but it costs more to store and search. Whatever dimension you choose must match your vector database index. Changing it later means re-embedding everything.
Max tokens act like a ceiling, the largest input the model can accept in a single request.

Max Tokens vs Chunk Size

Chunk size (configured separately during ingestion) is the actual size of each text piece you send. In practice, you chunk documents well below this limit to avoid "topic dilution," where one vector tries to represent too many ideas.

For example, if a single chunk contains an entire 3-page protocol covering airway management, medication dosages, and patient transport procedures, the resulting vector becomes a blurry average of all three topics. When Ramona searches for "epinephrine dosage," that chunk might not surface because its vector is pulled in too many directions. Smaller chunks, and ideally one idea per chunk, produce sharper vectors.

High-precision retrieval vs. Speed and cost efficiency

When Ramona adds the Azure OpenAI text-embedding-3-small model, the system infers 8191 max tokens after she clicks Verify Model Existence.

She knows what she wants: high-precision retrieval.

Small chunks + high dimensions = high-precision retrieval. Therefore, she changes the values to:

1536 dimensions (>256, the absolute minimum for this model). As a result, each vector has more numerical "slots" to encode subtle distinctions in meaning.
512 max tokens (< 8191) for small chunks, focused on, ideally, one idea.

If Ramona's priority is speed and cost efficiency over maximum precision, she would flip both settings to their lowest practical values:

256 dimensions, smallest vector size means least storage, fastest nearest-neighbor search.
512 max tokens, small chunks mean fewer API calls per query, less compute per embedding.

Publish Before You Use

A freshly added embedding model starts in an unpublished state. Before it can appear in any downstream configuration, collection, vectorization job, or evaluation, Ramona must click Publish. This workflow prevents half-configured models from being accidentally used in a pipeline.

Models can also be deprecated: existing configurations keep working, but no new ones can reference a deprecated model. This is useful when migrating to a better model without breaking what already runs.

Pillar 3 — Vector Databases (Destinations)

Vectors need a home. In SAS Retrieval Agent Manager, that home is called a destination, a vector store where embeddings and their metadata land after a vectorization job.

Why not a regular database? Traditional databases are built for exact lookups: "find every record where status = active." They match on precise values.

A vector database does something fundamentally different, it searches by meaning. Ask it to "find the five passages closest in meaning to this query," and it compares numerical vectors to find the best semantic matches, even when the wording differs completely.

This operation is called approximate nearest neighbor (ANN) search. It requires specialized indexing that relational databases were never designed for, and it's the engine that makes RAG retrieval work.

Two options are available in SAS Retrieval Agent Manager:

PGVector is a PostgreSQL extension that bolts vector search onto a familiar relational database. It's the default destination, ideal for small to medium datasets and teams already comfortable with SQL.
Weaviate, a purpose-built vector database designed for large-scale semantic search. It supports hybrid search (combining semantic similarity with keyword filters), horizontal scaling, and sub-100-millisecond latency on large datasets.

*Feature*	PGVector	Weaviate
Purpose-built for vectors	No (extension on relational DB)	Yes
Hybrid search	Limited (SQL filters only)	Yes (semantic + keyword)
Scalability	Vertical; practical up to millions	Horizontal; scales to billions
Latency at scale	Higher for large ANN queries	Sub-100 ms
Ease of adoption	Familiar SQL ecosystem	New system, new APIs
Best for	Small/medium datasets, unified storage	RAG and semantic search at scale

Back to Ramona: For her regional EMS library, containing dozens of protocols, not millions of records, PGVector is more than enough. If her network later scales to a statewide system with hundreds of agencies and tens of thousands of documents, Weaviate offers a clear upgrade path without rearchitecting the rest of the pipeline.

Best Practices

From the broader RAG community, here are a few practices that consistently pay off:

Separate your LLMs by role. Use one model for chat generation and a different one for evaluation. Self-grading leads to inflated scores.
Always test the connection. Every provider screen has a "Check connection" or “Verify Model Existence” button. Use it every time. An untested connection is a future 2 a.m. incident.
Plan for credential rotation. Cloud API keys expire. Automate retrieval where possible and document the update path for your team.
Start small, then scale. Begin with a lightweight embedding model like all-MiniLM-L6-v2 to validate your pipeline. Swap in a larger model once you know your chunking strategy and evaluation criteria are solid.
Match dimensions to your index. Changing the embedding dimension later means re-vectorizing your entire collection. Choose deliberately up front.
Test multiple configurations. SAS Retrieval Agent Manager lets you create multiple configurations per collection with different embedding models, chunk sizes, and strategies. Run evaluations (covered in a future post) to find the best-performing setup before promoting it to production.

Conclusion

Configuration is not glamorous, but it is the foundation everything else stands on. In this post, we walked through the three pillars Ramona sets up before her knowledge library can answer a single question:

LLMs for chatting, generating data and evaluating.
Embedding models for translating text into searchable vectors.
Vector databases for storing and retrieving those vectors at speed.

Get these right, and the rest of the RAG workflow, ingestion, evaluation, chat, agents, automation, has a reliable base to build on.

Next Steps

In the next post, we will follow Ramona into document ingestion and vectorization: how she organizes sources into collections, chooses chunking strategies, and kicks off vectorization jobs. If configuration is the foundation, ingestion is the first floor.

Ready to try it yourself? The hands-on workshop is available now: A Smarter Way to Unlock Unstructured Data: SAS Retrieval Agent Manager in Action.