In the realm of Large Language Models (LLMs), the ability to chat with unstructured text data has been well-established. But what if we could extend this capability to interact with structured data? What if an AI could write SQL queries and even perform table joins to answer your data related questions? This is not a mere SQL fiction. With LangChain, a framework for developing applications powered by language models, and Azure OpenAI, this is reality.
To illustrate this, consider the structured data stored in an Azure SQL database. This database houses two tables: the 'Titanic' dataset from Kaggle and an invented 'Social' table filled with additional information. These tables can be joined using a common key.
LangChain, with its SQL agent, can connect to the database and execute actions. Combined with the GPT-4 model deployed in an Azure OpenAI resource, it can answer questions by executing SQL queries, even figuring out how to join the two tables if necessary.
To demonstrate, consider a Python program run from Visual Studio Code. This program sends questions to the LangChain agent and waits for an answer. The objective is to determine if the LangChain SQL agent can join two tables to retrieve the correct answer.
Sounds like SQL-Fiction? Watch the video:
The chain of thought triggered by the question: "What is the Name of the passenger that has a Middle Status and has survived?"
"""
This script connects to a SQL database and runs a few basic queries.
It also initializes an Agent for chat interaction with the database.
"""
import os
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
from langchain.sql_database import SQLDatabase
from langchain.chat_models import AzureChatOpenAI
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.agents import AgentExecutor
from langchain.agents.agent_types import AgentType
from langchain.agents import create_sql_agent
import config
# Set up database connection parameters
username = "myuser"
os.environ["SQL_SERVER_USERNAME"] = username
server_short = "mysqlsrv"
os.environ["SQL_SERVER"] = server_short
server = "mysqlsrv.database.windows.net"
os.environ["SQL_SERVER_ENDPOINT"] = server
password = "*********"
os.environ["SQL_SERVER_PASSWORD"] = password
database ="geldb"
os.environ["SQL_SERVER_DATABASE"] = database
driver = "{ODBC Driver 18 for SQL Server}"
# Set up SQLAlchemy connection
db_config = {
'drivername': 'mssql+pyodbc',
'username': os.environ["SQL_SERVER_USERNAME"] + '@' + os.environ["SQL_SERVER"],
'password': os.environ["SQL_SERVER_PASSWORD"],
'host': os.environ["SQL_SERVER_ENDPOINT"],
'port': 1433,
'database': os.environ["SQL_SERVER_DATABASE"],
'query': {'driver': 'ODBC Driver 18 for SQL Server'}
}
db_url = URL.create(**db_config)
db = SQLDatabase.from_uri(db_url)
# Set up Azure OpenAI
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_KEY"] = config.api_key
os.environ["OPENAI_API_BASE"] = config.api_base
os.environ["OPENAI_API_VERSION"] = config.api_version
llm = AzureChatOpenAI(deployment_name="gpt-4", temperature=0, max_tokens=4000)
# Set up SQL toolkit for LangChain Agent
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
# Initialize and run the Agent
agent_executor = create_sql_agent(
llm=llm,
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)
# Ask Questions
agent_executor.run("how many tables can be joined? by what key?")
# ... Add more questions here ...
In this script, several operations are performed:
Prerequisites:
The results are impressive. The LangChain agent successfully identifies a common key and joins two tables to retrieve the required information. The accuracy of the results depends on the question asked and the keywords used. For example, “combine” which semantically is remarkably similar to “join” produced the best result.
The SQL agent utilizes table metadata to determine which columns can be used in a join, thereby underscoring the importance of metadata and data quality. As the LLM leverages the database schema, table and column names, data types, and frequency distributions to make a column selection, the metadata becomes crucial. I think in the future, we will be forced to choose well the column names, the descriptions, if we want to use AI agents to probe the data.
When it comes to code generation, the LangChain agent prompts the LLM to create SQL queries. These queries are then tested against the database until a successful query is achieved. Even in case of a SQL syntax error, the agent feeds the error back into the LLM, which then refines the SQL query, exemplifying an effective 'brute-force' approach.
It seems that, whether human or machine, the tendency to take shortcuts to avoid effort or mental fatigue is a shared trait. If an AI learnt from code thrown at it, it also learnt to take shortcuts, just like we do when we program. Interestingly, the LLM is capable of inferring that two columns contain similar information, even if the data types are different. The inference was incorrect, though. Unless you know the data and you test these inferences, you will not know you have been given the wrong answer.
In conclusion, LangChain and Azure OpenAI have shown remarkable capabilities in interacting with structured data, writing SQL queries, and performing table joins. As we move forward, the importance of metadata, data quality and the ability of humans to detect logic errors, will continue to play a significant role in achieving accurate results.
My inspiration comes from Valentina Alto's post, Building your own DB Copilot for Azure SQL with Azure OpenAI GPT-4.
Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about having conversations with your data. If you wish to get more information, please write me an email.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.