Orchestrating & Governing Large Language Models in SAS Viya

3 Likes

Enterprises and organisations want to harness the power of large language models (LLMs). The question is, how do they turn these models into useful applications that generate business value? How do they utilize these models with minimal risk?

With SAS Viya, there are three products that help organisations get value from LLMs: Information Catalog (IC) to govern prompts as a prompt catalog, Model Manager (MM) for LLM governance, and Intelligent Decisioning (ID) to orchestrate LLMs and other ML components into applications. Additionally, we use Visual Text Analytics (VTA) as LLM guardrails, which we will elaborate in greater detail as well as Visual Analytics (VA) to create the LLM evaluation and monitoring dashboard.

In this post, we'll explore this using an example use case of a customer service chat bot for a bank. This chatbot is primarily driven using a LLM enhanced with Retrieval Augmented Generation (RAG). If you're not familiar with RAG, think of it as connecting your LLM to a database of documents so the LLM can reason from these documents. In our use case, the LLM is connected to documents detailing all the different services and products the bank offers, so that the LLM can answer any customer enquiry on these products.

Example Use Case and Architecture

In this given example, the LLM enhanced with RAG is deployed as an endpoint that can be called through an API. Usually, you could create a frontend application, such as through a framework like streamlit, and interact with your LLM through chat.

What we do is augment this with SAS Viya - instead of just calling the API endpoint directly, we embed the RAG endpoint within a decision flow from ID and then publish it to the SAS Micro Analytic Service (MAS) callable by its own API endpoint from SAS. By doing so, we can embed additional guardrails; for this use case, that means customers are able to use the LLM more securely, as well as have a better customer experience. Additionally, by doing it this way, we can also govern our prompts in IC, along with the LLM in MM before integrating into the decision flow and operationalising which improves our overall governance. Again, instead of calling an LLM provider API (or in this case, the RAG endpoint), you can wrap it inside a SAS API which your consuming applications call.

The actual ID flow is shown above, and this flow of operations executes when the SAS API is called. Let's take a look at what it takes to bring this together, as well as the specifics for this use case.

Governing the LLM in Model Manager

Let's take a look first at MM, where we govern the model. LLMs are incredibly heavy and contain billions of parameters. Rather than store the model weights directly in MM like a traditional ML model, we store the score code that executes the LLM through an API. Using this method, we can still govern the LLM in question but we get more flexibility since the "model" in MM can represent anything from a foundation model (think GPT 3.5, GPT4o, Gemini etc.), a finetuned open-source model from HuggingFace, or a RAG pipeline like in our specific example.

On top of that, we can even utilize the model card to help store critical information about the GenAI model, such as: details about the use case, how the model should be used and even the prompts related to a model. For example, the stated aims of using GenAI here are to improve customer satisfaction and net promoter score, and we also clearly identify that private and sensitive information should not be provided by customers using this chatbot. We will visit later how SAS Viya can bring in guardrails that enforce these goals.

For this RAG pipeline, we want to track the system prompts that have been approved by our enterprise to be used with this customer service use case. If this was a finetuned model, these could be the prompts the model was fine-tuned on. By attaching a table of prompts, we can connect our LLM repository with the prompt catalog (which is IC) and enhance the governance of our GenAI applications.

Governing the Prompts in IC as a Prompt Catalog

Prompts carefully designed by SMEs and the business can be stored and governed in Information Catalog. For our specific customer service chatbot use case, we can see that these prompts designed by the business have been approved for use as well as have been tagged appropriately. Data tagging is an important concept, especially if prompts can potentially private and sensitive information that can also potentially be used for finetuning.

We also get a view across how the prompts in our prompt catalog look like using the out of the box analysis from IC.

In this specific example, we have 5 prompts that have been approved for use with this use case. These prompts have been given a score, with 1 being the lowest and 5 being the highest - the highest prompt is being used as the system prompt for the customer service use case.

Now that we've looked at the governance side of things, let's look at how we operationalize the RAG endpoint, and orchestrate it together with additional guardrails and components.

For further reading on the prompt catalog, please also refer to the following blog.

Orchestrating the LLM in Intelligent Decisioning

As stated earlier, instead of having the prompt go directly to the API endpoint (containing the RAG pipeline), we have it go through a decision flow. ID is able to orchestrate LLMs along with business rules, analytical models (ML, text etc.) together in a single workflow that can be operationalized. This is known as an agentic workflow. In this specific example, the customer enquiry will go through a possible combination of business rules, text analytics models before being sent to the RAG API endpoint.

Since we store the RAG API endpoint as a model, we can bring it easily into the decision flow and we know that we're using it with minimal risk since it's governed. We can also utilize a query node to query for the specific system prompt from the prompt catalog, and maintain similar governance.

What's the purpose of the text analytics models? These models built in Visual Text Analytics (VTA) are our guardrails! Recall that we specified that for this use case we have two requirements (this is from the model card in MM):

Ensure private and sensitive information is not shared by the customer
Improve customer experience and satisfaction

These guardrails we're implementing act as our controls for these two requirements and enforce our goals.

In the first example, we use a category model that looks for sensitive information, such as bank account numbers. In our decision flow, if we detect sensitive information then we want to respond to the customer that they should not provide sensitive and private information through a plain business rule. With this method, we ensure compliance but also save cost since we don't need to call the LLM.

For our second requirement, we might think that most customers using this particular service chatbot would most likely be quite frustrated with the service they're experience - otherwise why are they looking for help? Therefore, we want the customer experience to be more dynamic and understanding towards their wellbeing.

We can utilize a sentiment analysis model that detects if the sentiment in the prompt is negative. If it is negative, add to the system prompt that the LLM needs to be more understanding. This dynamically changes the language of the LLM, and we can see in the example how it responds to a frustrated customer.

To summarize, ID is able to orchestrate GenAI components along with business rules and models into a single decision flow, and then deploy that into production that can be called by external applications.

Evaluation using Visual Analytics

The final phase of all of this is - how do we know how our customer service chatbot is performing? While we're able to deploy it, like any ML problem we need to monitor the application to ensure that we're getting the right outcomes.

Evaluation is especially tricky with LLMs - let's use an example to demonstrate this. With our use case, we can see that if we ask a question around home loans we see that the chatbot responds that it doesn't know. Since it's a customer service chatbot, it stands to reason it should be able to answer this, which means that we have a gap in our application. However, we can't have a process where you have to manually determine where the gaps are by repeatedly asking questions. We need a more industrial way to identify these gaps and have a process to fix them as well.

In SAS Viya, we can utilize Visual Analytics (VA) as a LLM evaluation dashboard so that we can continually monitor, and evaluate the LLM application to ensure that it's performing the way we want. For example, as above we can simulate the run of different prompts on the decision flow itself.

We can also evaluate the performance of the LLM itself. SAS Viya integrates with any custom or 3rd party framework that performs LLM monitoring, allowing greater flexibility since potentially any framework can be used. In this use case, we use a Python evaluation framework known as DeepEval. This is a Python package that utilizes judge LLMs (essentially LLMs that are prompted to evaluate responses) to create diagnostics for evaluating the LLM. Once this framework is run, either outside of SAS Viya or orchestrated as a SAS Job on Viya, the evaluation data can be brought onto the platform and viewed in VA.

In this instance, we're interested in answer relevancy and hallucination. Broadly speaking, these metrics tell us whether the LLM is actually answering the question that we're asking it. When we look at the report, even though the application is performing well overall, we can see there is one response that's being seen as not answering the question. If we dig deeper, we can see this is a question is about home loans and on both answer relevancy and hallucination, we're able to systematically determine a gap in the application's ability.

If you're interested in reading more about the metrics being used, please refer to the DeepEval documentation on Answer Relevancy and Hallucination are defined.

Deploying New Outcomes with Minimal Risk

How do we fix this gap? Issues like this usually could have multiple causes, such as a lack of data used in the RAG pipeline itself, or maybe emerging from the prompt being used. Let's investigate changing the prompt, and use this example to showcase how new changes can be re-deployed, picked up by the consuming application, all with minimal risk.

If you recall, we chose to use the prompt with score 5. Following on from our investigation, we might decide that selecting the prompt with score 4 might be better. Prompt 4 differs with one key line: "If a question is asked that is not related to ABC [Bank], respond with that you do not have expertise in that area". By removing this from the prompt, perhaps the LLM can use more of it's reasoning capabilities.

In ID, this is as simple as changing our decision flow variable to use the new prompt from the prompt catalog, and then re-deploying.

We can see now when we ask the same question, we get a different response! The customer service chatbot is now able to more accurately answer for the customer about the bank's home loan policy. Of course, there might still be avenues to improve the application by either tuning the LLM itself, or adding data sources to the RAG pipeline but the key loop of implementing changes in ID, and then re-deploying makes this easy for any organisation to deploy those changes with minimal risk.

Summary

I hope from reading this blog you've gained an appreciation for how SAS Viya can enhance LLM applications through providing capabilities around governance and orchestration. While we walked through a customer service chatbot example for a bank, the portfolio of capabilities here discussed can be applied to any variety of LLM and GenAI use-cases!