Instead of directly searching in the SAS Information Catalog, we can use an intelligent agent, powered by an LLM, to examine a catalog extract. This extract, a CSV file, houses metrics about cataloged tables and files obtained from the SAS Information Catalog through the Catalog REST API. This indirect approach provides meaningful insights about the Catalog's assets. As we'll explore, there are potential advantages to using an LLM. It allows for follow-up questions based on the answers received and enables output formatting according to the user's preference.
For clarity, the LLM denotes GPT-4 from Azure OpenAI, while the Catalog represents the SAS Information Catalog.
In brief, the steps to use a LLM to search through cataloged assets are as follows:
The following diagram highlights well the different components:
Disclaimer: the image was generated using my personal ChatGPT Plus subscription.
Watch the following videos, where we explore the approach.
The comparison is carried out through several examples where tables are queried using specific keywords. Both methods demonstrate similar proficiency in finding tables with column keywords such as 'temple' and 'brussels', and even those in different languages. Whether it's searching tables by table keywords, filtering by data quality completeness less than 95%, or identifying columns that contain content in Mandarin, the search results from both methods are closely matched.
However, there are instances when the results differ, such as when querying for columns with private data or columns semantic types.
This discrepancy occurs because the catalog uses facets for search, while the LLM searches for a keyword across all cataloged facets.
The flexibility of the LLM becomes apparent, especially when it comes to customizing the query output.
LLM also excels in managing free text searches, such as locating references to specific keywords or when utilizing Chinese characters (汉字 Hànzì).
Findings:
Below are a few queries and their results using either the catalog search or the LLM query.
Prompt: How many unique tables have data quality completeness less than 0.95 or 95%? What is their name?
Catalog: 16 results.
LLM: 16 results.
Prompt: List the Library – the Table – the Column – semantic type, where the information privacy of the column is private (private data was found in that column).
Catalog: 20 results.
LLM: 15 results. The difference can be attributed to the manner in which we phrase the question. If we aim to identify all columns with private data, tables may be repeated. In contrast, the catalog displays all tables with at least one column containing private data. The LLM search should yield more results. However, it doesn't. Could this be due to token limits affecting the completeness of the output?
Prompt: Which tables contain columns semantic type cataloged as ‘place’?
Catalog: 7 results.
LLM: 12 results. This is intriguing and can be attributed to the fact that 'place' is searched across various fields, not just confined to the semantic type column.
Here is the Python program that defines the LangChain intelligent agent, which interacts with the SAS Information Catalog:
# The program allows you to chat with a CSV file.
# Install packages pip install -U python-dotenv langchain langchain-openai langchain_experimental langchain_core pandas tabulate
# Import Azure OpenAI
from langchain_openai import AzureChatOpenAI
import os
# Import Azure OpenAI variables from .env file. This file is assumed present in the same folder with this program.
## Set environment variables needed by LangChain
from dotenv import load_dotenv
load_dotenv()
azure_oai_endpoint = os.getenv("AZURE_OAI_ENDPOINT")
azure_oai_key = os.getenv("AZURE_OAI_KEY")
azure_oai_deployment = os.getenv("AZURE_OAI_DEPLOYMENT")
azure_oai_model = os.getenv("AZURE_OAI_DEPLOYMENT")
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["AZURE_OPENAI_API_KEY"] = azure_oai_key
os.environ["AZURE_OPENAI_ENDPOINT"] = azure_oai_endpoint
os.environ["OPENAI_API_VERSION"] = "2023-12-01-preview"
# Define LLM
llm = AzureChatOpenAI(deployment_name=azure_oai_deployment, temperature=0, max_tokens=2000)
print (llm)
# CSV file - the catalog download
my_csv_file = 'catalog_download.csv'
# Use CSV DataFrame agent in LangChain
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_csv_agent
# Set up the prompt and memory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
MEMORY_KEY = "chat_history"
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"""Answer the questions step by step. The history of the messages is critical and important to use. """
),
MessagesPlaceholder(variable_name=MEMORY_KEY),
("user", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
]
)
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain_core.messages import AIMessage, HumanMessage
chat_history = []
agent = (
{
"input": lambda x: x["input"],
"agent_scratchpad": lambda x: format_to_openai_function_messages(
x["intermediate_steps"]
),
"chat_history": lambda x: x["chat_history"],
}
| prompt
| llm
| OpenAIFunctionsAgentOutputParser()
)
agent = create_csv_agent(
llm, my_csv_file, verbose=True, agent_type=AgentType.OPENAI_FUNCTIONS)
# Ask questions
print ('Start chatting with your file / table...\n')
# Continuously listens for text input to send as text to Azure OpenAI
def chat_with_open_ai():
while True:
print("Azure OpenAI is listening. Write 'Stop' or press Ctrl-Z to end the conversation.")
try:
# Get input
q = input("Enter your question: \n")
if q == "Stop":
print("Conversation ended.")
break
else:
print("\nI will ask: ", q)
result = agent.invoke({"input": q, "chat_history": chat_history})
chat_history.extend(
[
HumanMessage(content=q),
AIMessage(content=result["output"]),
]
)
except EOFError:
break
# Main
try:
chat_with_open_ai()
except Exception as err:
print("Encountered exception. {}".format(err))
This post compares the efficacy of the SAS Information Catalog search and an intelligent agent, LLM (GPT-4 from Azure OpenAI), in querying cataloged assets.
The comparison is conducted through several examples where tables are searched using specific keywords. Both systems display similar proficiency in locating tables with single keywords like 'temple', 'brussels', and even multi-lingual variations of a keyword. The same pattern is consistent when searching multiple keywords, such as 'fun', 'entertainment', 'night', 'show', searching for tables with data quality completeness less than 95% or identifying columns cataloged as containing content in Mandarin.
However, there are instances when the results differ, such as when querying for columns with private data or columns cataloged as 'place'.
LLM also proves superior in handling free text searches, such as finding references to 'Feng Feng', 'St Patrick', or handling non-alphabetic characters.
Finally, LLM's ability to provide fresh perspectives and its flexibility is highlighted, particularly when context awareness and memory are required.
Using a Catalog extract also has its disadvantages, the most significant of which is that it becomes an offline process, disconnected from updates in the Catalog or data structure. If new data is added to the catalog or existing data is cataloged, the CSV extract needs to be updated to reflect these changes.
In conclusion, LLM's advanced capabilities make it a powerful tool for information retrieval from cataloged assets.
Thanks to Nancy Rausch and Lavanya Ganesh, for initiating the question:
Can we use an LLM to query the catalog data?
Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about having conversations with your data. If you wish to get more information, please write me an email.
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.