Considerations for evaluating the effectiveness of RAG models

3 Likes

How much do you trust your AI chat ‘partner’ to give you reliable answers and maybe even advice? The responses you get are designed to sound confident - almost as if they came from a knowledgeable and sentient being.

This may be sufficient to convince many casual users to trust them - maybe even for making critical decisions. But trust needs to be earned. In this post I highlight some things to be aware of when engaging with these AI systems.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

I encourage readers of this post to share their knowledge of the workings of Large Language Models (LLMs) with their friends to increase awareness about the risks of trusting the generated answers without reservation.

A recent Wall Street Journal article describes a situation where a doctor asked ChatGPT to retrieve case studies related to the safety of a medical procedure for a patient he was treating. The ChatGPT response was that the procedure was safe, and it cited realistic looking references to reports published in medical journals. But the references for the reports were completely fabricated. Trusting this chat answer without verification could have had dire consequences.

Large Language Models (LLMs) are trained using very large quantities of text to perform natural language tasks like text generation, translation, and summarization. They are stochastic and designed to predict the next word, sentence, paragraph or full document in a sequence based on context. They accept user prompts and return results that could be made up of text, audio and images. They also require lots of computing power, electricity, and water for cooling servers, time and money to complete training. As for the scale of these models, some utilize trillions of parameters or weights.

These large-scale models can also have limitations that may not be obvious to the end user.

ChatGPT 5.2, for example, is a Large Language Model built using static, pre-trained weights that were estimated on data as of the cutoff date: August 31, 2025. Information related to data generated beyond this date is not in the trained model's weights and can't be reflected in the model's answers.

However, recent enhancements allow certain LLM versions to access data via a real-time web search and also exploit external APIs that can supplement the available data beyond the training cutoff date.

This is where Retrieval-Augmented Generation (RAG) models come into play. Rag models can update an existing LLM to include company or domain specific information (such as medical, financial, legal or technical manuals) to improve accuracy of results returned from query prompts. The addition of domain specific information may reduce hallucinations by providing real, “ground truth” data to the query engine.

A RAG pipeline starts with a retrieval component that fetches relevant text data from an external knowledge base. The documents are “chunked” based on a specified token length. Vector embeddings are generated to represent the documents numerically. A small amount of chunk overlap is sometimes used in the process of breaking down larger documents to try and maintain context.

The resulting augmented document collection is used to provide context to try and improve the generated responses. The system then generates a final response based on the augmented information. For more details, see RAG 101.

RAG can reduce hallucinations by using the retrieved text to provide vetted information. New data that is provided by the augmentation phase is available to the generation LLM without requiring it to be retrained beyond the cutoff date. This technique can be useful in retrieving industry specific, for example legal or medical information, for accuracy.

SAS Retrieval Agent Manager (RAM) is an example of an application that integrates AI and LLMs using the RAG framework. It is designed to reduce manual data extraction efforts and overcome the limitations of traditional retrieval-augmented generation (RAG) implementations. It supports both user and automated evaluations using key metrics from the Ragas framework.

Implications:

Users of large language models are usually charged by the number of tokens sent and received while processing their queries. A large language model that returns useful information early in the search result is less costly than one that does not return good information in an early response.

It may seem plausible that increasing the scale of these models (i.e. amount of data and number of parameters) can improve results. Yet there are limits. Larger models perform better but also fail more confidently. If you’re interested in a discussion of exploring limitations on scale of LLMs from a computational and statistical perspective, check out this recent paper: On the fundamental limits of LLMs at scale.

Hallucinations:

To better understand the causes of hallucinations, we'll now consider the training data for LLMs.

There may be unverified claims, satirical material considered to be true or conflicting information in data sources used. How this information will be interpreted by the model is difficult to predict and may contribute to producing unreliable results. Different data sources also disagree on facts depending on cultural perspective, political or time-dependent information.

Training data might contain noise or errors in facts (for example some articles may state “the capital of Australia is Sydney” while others may state “the capital of Australia is Canberra” but only one stated fact is true).

Models built using data that does not account for the temporal aspect of events may cause a model to learn both the correct and incorrect information for the same topic since statements that were facts at time t+1 may still not be true at time t+2. The result that is returned also relies heavily on how specific the prompt detail is.

LLMs have been known to make up a false but reasonable and confident sounding response rather than admitting it “doesn’t know the answer”. This tendency is tied into one of the internal scoring techniques used by LLMs that weigh an “I don’t know” response as an incorrect answer. LLMs that discourage an “I don’t know” answer also contribute to having the model make up stuff and hallucinate.

Measuring LLM performance and accuracy:

The process of measuring and tracking traditional Machine Learning models for accuracy and performance over time is well-established using proven reliable techniques.

This is still an emerging area when measuring performance of LLMs, RAG models and Agentic AI applications. Without the ability to evaluate the accuracy of these models, decisions based on their output may be decisions based on bad information or inaccurate data.

To measure performance of a RAG system, different metrics can be used for the retrieval portion and for the generation portion.

Retrieval:

A sampling of the many examples of measures that can be used for evaluating the retrieval phase results include:

Precision@k can be used to measure Relevance: how well the retrieved information satisfies the query. The @k part can be described as how many of the first “K” documents satisfy the query. If in the first 10 documents returned by LLM1, 6 contain relevant information, the precision would be higher than if only 2 of the 10 retrieved documents were considered relevant by LLM2. The first LLM would probably be the preferred choice for the data selected.

But there is also another metric that is based on how early in the results a match occurs. A model that finds a matched document in the 1st position would be preferred to a model that finds the first matched document for a query in the 19th position. This has implications for the cost of running the system. Since providers charge by the ‘token’ (think of this as word) received or sent, getting a valid match early in the search saves having to process more tokens before getting the information needed for the Generation phase.

Recall@k refers to how many of all relevant documents were returned in the first k number of documents.

nDCG (Normalized Discounted Cumulative Gain) Evaluates ranking quality by combining graded relevance and document position. It rewards correct items appearing higher in rank.

Mean Reciprocal Rank (MRR) measures how early the first relevant result appears in the list.

For more details check out A complete guide to RAG evaluation: metrics, testing and best practices

These metrics rely on knowing the number of relevant documents for a query. Finding this is easier said than done. Getting this number for a query takes effort either from manually labeling documents for a query, using benchmarks or using custom datasets that provide this information.

The number of relevant documents is sometimes determined by using another LLM to make an assessment. If this sounds a little like asking students to grade their own papers you may be correct. Still, this technique is an option that is sometimes used.

There is also a possibility of propagating unexpected errors into the system, especially if interim retrieval results are combined with queries to feed the generation portion of RAG.

Here is another approach to designing an evaluation mechanism that takes a different perspective. Sometimes for evaluation of RAG models, developers might ask an LLM to generate a question that can be answered using data from a document. They would then include their own answer as ground truth based on the document and build evaluation data for the model. This is also labor intensive and does not guarantee the absence of bias.

Note that there is another approach based on “reference-free” tests that can be used to try and assess the effectiveness of an LLM that does not rely on using ground truth data. See Evaluation metrics

Generation Metrics:

A sampling of the many examples of measures that can be used for evaluating the generation phase results include:

Faithfulness / Groundedness – percent of answer claims supported by the retrieved context (claim level support

Answer Relevancy – semantic alignment of the answer to the question (not just fluency).

BERTScore / Answer Similarity – semantic similarity to references when wording differs.

Recommendations:

Use LLMs to assist with mundane tasks such as drafting emails. Be somewhat more cautious if you use them for summarizing documents and for generating programming code. Do not rely on LLMs exclusively for medical, financial, healthcare or relationship advice.

In your prompts, be sure to ask for the LLM to provide citations and references in responses. Always check that the references are legitimate.

Summary:

LLMs are powerful predictive text models.

RAG strengthens LLM results by grounding them in up-to-date, factual context.

Accuracy challenges include subjectivity, hallucinations, bias and temporal data.

Evaluation metrics used for retrieval relevance, response faithfulness, and generated accuracy.

For a more in-depth discussion on "Evaluating the Performance of Retrieval Augmented Generation Pipelines using the Ragas Framework" refer to this post by Ari Zitin

Thanks for reading!

Find more articles from SAS Global Enablement and Learning here.

Bogdan_Teleuca · ‎02-23-2026

Thanks for your research, explaining the concepts and providing additional resources! Included the link to the post in the workshop.

Considerations for evaluating the effectiveness of RAG models

Catch up on SAS Innovate 2026

SAS AI and Machine Learning Courses