Using an LLM to Query Cataloged Assets

Instead of directly searching in the SAS Information Catalog, we can use an intelligent agent, powered by an LLM, to examine a catalog extract. This extract, a CSV file, houses metrics about cataloged tables and files obtained from the SAS Information Catalog through the Catalog REST API. This indirect approach provides meaningful insights about the Catalog's assets. As we'll explore, there are potential advantages to using an LLM. It allows for follow-up questions based on the answers received and enables output formatting according to the user's preference.

For clarity, the LLM denotes GPT-4 from Azure OpenAI, while the Catalog represents the SAS Information Catalog.

In brief, the steps to use a LLM to search through cataloged assets are as follows:

Catalog caslibs or SAS libraries using the SAS Information Catalog.
Acquire a SAS Viya access token.
Export the metrics into a CSV file.
Develop an intelligent agent using LangChain and Python (code provided in this post).
Query the CSV file (watch the videos in this post).

The following diagram highlights well the different components:

Disclaimer: the image was generated using my personal ChatGPT Plus subscription.

Video

Watch the following videos, where we explore the approach.

Same Results

The comparison is carried out through several examples where tables are queried using specific keywords. Both methods demonstrate similar proficiency in finding tables with column keywords such as 'temple' and 'brussels', and even those in different languages. Whether it's searching tables by table keywords, filtering by data quality completeness less than 95%, or identifying columns that contain content in Mandarin, the search results from both methods are closely matched.

(view in My Videos)

Different Results

However, there are instances when the results differ, such as when querying for columns with private data or columns semantic types.

This discrepancy occurs because the catalog uses facets for search, while the LLM searches for a keyword across all cataloged facets.

The flexibility of the LLM becomes apparent, especially when it comes to customizing the query output.

LLM also excels in managing free text searches, such as locating references to specific keywords or when utilizing Chinese characters (汉字 Hànzì).

(view in My Videos)

Findings:

The formulation of your prompt is crucial. Different prompts may yield different results.
A lack of results does not necessarily mean that what you're searching for is absent from the catalog or the raw data!

Queries

Below are a few queries and their results using either the catalog search or the LLM query.

Same Results

Prompt: How many unique tables have data quality completeness less than 0.95 or 95%? What is their name?

Catalog: 16 results.

02_BT_400_Catalog_LLM_Query_Catalog_completeness_percent-1536x668.png

LLM: 16 results.

03_BT_410_Catalog_LLM_Query_LLM_completeness_percent.png

Different Results

Prompt: List the Library – the Table – the Column – semantic type, where the information privacy of the column is private (private data was found in that column).

Catalog: 20 results.

04_BT_430_Catalog_LLM_Query_Catalog_column_private_data-1536x672.png

LLM: 15 results. The difference can be attributed to the manner in which we phrase the question. If we aim to identify all columns with private data, tables may be repeated. In contrast, the catalog displays all tables with at least one column containing private data. The LLM search should yield more results. However, it doesn't. Could this be due to token limits affecting the completeness of the output?

05_BT_420_Catalog_LLM_Query_LLM_column_private_data.png

Prompt: Which tables contain columns semantic type cataloged as ‘place’?

Catalog: 7 results.

06_BT_440_Catalog_LLM_Query_Catalog_column_semantic_type-1536x666.png

LLM: 12 results. This is intriguing and can be attributed to the fact that 'place' is searched across various fields, not just confined to the semantic type column.

07_BT_450_Catalog_LLM_Query_LLM_column_semantic_type.png

Python Program

Here is the Python program that defines the LangChain intelligent agent, which interacts with the SAS Information Catalog:

# The program allows you to chat with a CSV file.
# Install packages pip install -U python-dotenv langchain langchain-openai langchain_experimental langchain_core pandas tabulate

# Import Azure OpenAI
from langchain_openai import AzureChatOpenAI
import os

# Import Azure OpenAI variables from .env file. This file is assumed present in the same folder with this program.
## Set environment variables needed by LangChain
from dotenv import load_dotenv
load_dotenv()
azure_oai_endpoint = os.getenv("AZURE_OAI_ENDPOINT")
azure_oai_key = os.getenv("AZURE_OAI_KEY")
azure_oai_deployment = os.getenv("AZURE_OAI_DEPLOYMENT")
azure_oai_model = os.getenv("AZURE_OAI_DEPLOYMENT")
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["AZURE_OPENAI_API_KEY"] = azure_oai_key
os.environ["AZURE_OPENAI_ENDPOINT"] = azure_oai_endpoint
os.environ["OPENAI_API_VERSION"] = "2023-12-01-preview"
# Define LLM
llm = AzureChatOpenAI(deployment_name=azure_oai_deployment, temperature=0, max_tokens=2000)
print (llm)

# CSV file - the catalog download
my_csv_file = 'catalog_download.csv'

# Use CSV DataFrame agent in LangChain
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_csv_agent

# Set up the prompt and memory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

MEMORY_KEY = "chat_history"
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """Answer the questions step by step. The history of the messages is critical and important to use. """
        ),
        MessagesPlaceholder(variable_name=MEMORY_KEY),
        ("user", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain_core.messages import AIMessage, HumanMessage

chat_history = []

agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_function_messages(
            x["intermediate_steps"]
        ),
        "chat_history": lambda x: x["chat_history"],
    }
    | prompt
    | llm
    | OpenAIFunctionsAgentOutputParser()
)

agent = create_csv_agent(
    llm, my_csv_file, verbose=True, agent_type=AgentType.OPENAI_FUNCTIONS)

# Ask questions
print ('Start chatting with your file / table...\n')

# Continuously listens for text input to send as text to Azure OpenAI
def chat_with_open_ai():
    while True:
        print("Azure OpenAI is listening. Write 'Stop' or press Ctrl-Z to end the conversation.")
        try:
            # Get input
            q = input("Enter your question: \n")
            if q == "Stop":
                print("Conversation ended.")
                break
            else:
                print("\nI will ask: ", q)
                result = agent.invoke({"input": q, "chat_history": chat_history})
                chat_history.extend(
                    [
                        HumanMessage(content=q),
                        AIMessage(content=result["output"]),
                    ]
                )
        except EOFError:
            break

# Main

try:
    chat_with_open_ai()
except Exception as err:
    print("Encountered exception. {}".format(err))

Conclusions

This post compares the efficacy of the SAS Information Catalog search and an intelligent agent, LLM (GPT-4 from Azure OpenAI), in querying cataloged assets.

The comparison is conducted through several examples where tables are searched using specific keywords. Both systems display similar proficiency in locating tables with single keywords like 'temple', 'brussels', and even multi-lingual variations of a keyword. The same pattern is consistent when searching multiple keywords, such as 'fun', 'entertainment', 'night', 'show', searching for tables with data quality completeness less than 95% or identifying columns cataloged as containing content in Mandarin.

However, there are instances when the results differ, such as when querying for columns with private data or columns cataloged as 'place'.

LLM also proves superior in handling free text searches, such as finding references to 'Feng Feng', 'St Patrick', or handling non-alphabetic characters.

Finally, LLM's ability to provide fresh perspectives and its flexibility is highlighted, particularly when context awareness and memory are required.

Using a Catalog extract also has its disadvantages, the most significant of which is that it becomes an offline process, disconnected from updates in the Catalog or data structure. If new data is added to the catalog or existing data is cataloged, the CSV extract needs to be updated to reflect these changes.

In conclusion, LLM's advanced capabilities make it a powerful tool for information retrieval from cataloged assets.

What to Read Next?

Acknowledgements

Thanks to Nancy Rausch and Lavanya Ganesh, for initiating the question:

Can we use an LLM to query the catalog data?

Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about having conversations with your data. If you wish to get more information, please write me an email.

Find more articles from SAS Global Enablement and Learning here.