Exploring LangChain and Azure OpenAI’s Ability to Write SQL and Join Tables To Answer Questions

4 Likes

In the realm of Large Language Models (LLMs), the ability to chat with unstructured text data has been well-established. But what if we could extend this capability to interact with structured data? What if an AI could write SQL queries and even perform table joins to answer your data related questions? This is not a mere SQL fiction. With LangChain, a framework for developing applications powered by language models, and Azure OpenAI, this is reality.

To illustrate this, consider the structured data stored in an Azure SQL database. This database houses two tables: the 'Titanic' dataset from Kaggle and an invented 'Social' table filled with additional information. These tables can be joined using a common key.

LangChain, with its SQL agent, can connect to the database and execute actions. Combined with the GPT-4 model deployed in an Azure OpenAI resource, it can answer questions by executing SQL queries, even figuring out how to join the two tables if necessary.

To demonstrate, consider a Python program run from Visual Studio Code. This program sends questions to the LangChain agent and waits for an answer. The objective is to determine if the LangChain SQL agent can join two tables to retrieve the correct answer.

Sounds like SQL-Fiction? Watch the video:

(view in My Videos)

The chain of thought triggered by the question: "What is the Name of the passenger that has a Middle Status and has survived?"

The Python Script

"""  
This script connects to a SQL database and runs a few basic queries. 
It also initializes an Agent for chat interaction with the database. 
"""  
  
import os  
from sqlalchemy import create_engine  
from sqlalchemy.engine.url import URL  
from langchain.sql_database import SQLDatabase  
from langchain.chat_models import AzureChatOpenAI  
from langchain.agents.agent_toolkits import SQLDatabaseToolkit  
from langchain.agents import AgentExecutor  
from langchain.agents.agent_types import AgentType  
from langchain.agents import create_sql_agent  
import config  
  
# Set up database connection parameters  
username = "myuser"  
os.environ["SQL_SERVER_USERNAME"] = username  
server_short = "mysqlsrv"  
os.environ["SQL_SERVER"] = server_short  
server = "mysqlsrv.database.windows.net"  
os.environ["SQL_SERVER_ENDPOINT"] = server  
password = "*********"  
os.environ["SQL_SERVER_PASSWORD"] = password  
database ="geldb"  
os.environ["SQL_SERVER_DATABASE"] = database  
driver = "{ODBC Driver 18 for SQL Server}"  
  
# Set up SQLAlchemy connection  
db_config = {  
    'drivername': 'mssql+pyodbc',  
    'username': os.environ["SQL_SERVER_USERNAME"] + '@' + os.environ["SQL_SERVER"],  
    'password': os.environ["SQL_SERVER_PASSWORD"],  
    'host': os.environ["SQL_SERVER_ENDPOINT"],  
    'port': 1433,  
    'database': os.environ["SQL_SERVER_DATABASE"],  
    'query': {'driver': 'ODBC Driver 18 for SQL Server'}  
}  
db_url = URL.create(**db_config)  
db = SQLDatabase.from_uri(db_url)  
  
# Set up Azure OpenAI  
os.environ["OPENAI_API_TYPE"] = "azure"  
os.environ["OPENAI_API_KEY"] = config.api_key  
os.environ["OPENAI_API_BASE"] = config.api_base  
os.environ["OPENAI_API_VERSION"] = config.api_version  
llm = AzureChatOpenAI(deployment_name="gpt-4", temperature=0, max_tokens=4000)  
  
# Set up SQL toolkit for LangChain Agent  
toolkit = SQLDatabaseToolkit(db=db, llm=llm)  
  
# Initialize and run the Agent  
agent_executor = create_sql_agent(  
    llm=llm,  
    toolkit=toolkit,  
    verbose=True,  
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,  
)

# Ask Questions  
agent_executor.run("how many tables can be joined? by what key?")
# ... Add more questions here ...

In this script, several operations are performed:

Connection credentials for an Azure SQL database are set up.
A connection is established using SQLAlchemy, a SQL toolkit and Object-Relational Mapping (ORM) system for Python.
An Azure OpenAI connection to a GPT-4 deployment is set up, defining the LLM in the toolkit.
A SQLDatabaseToolkit is used to allow the LLM to interact with the database.
An AgentExecutor is created to execute the SQL queries.
Questions are sent. A chain of actions, observations, thoughts and more actions is initiated until a final answer.

Prerequisites:

An Azure subscription.
An Azure SQL DB deployed, and data loaded in the SQL DB.
Access to Azure OpenAI.
An Azure OpenAI resource with a model deployed. The demo used a gpt-4model.
Python (version 3.7.1 or later).
LangChain and SQLAlchemy libraries.
ODBC Driver for SQL Server.

Conclusions

Overall

The results are impressive. The LangChain agent successfully identifies a common key and joins two tables to retrieve the required information. The accuracy of the results depends on the question asked and the keywords used. For example, “combine” which semantically is remarkably similar to “join” produced the best result.

Column Selection: Metadata Is Important

The SQL agent utilizes table metadata to determine which columns can be used in a join, thereby underscoring the importance of metadata and data quality. As the LLM leverages the database schema, table and column names, data types, and frequency distributions to make a column selection, the metadata becomes crucial. I think in the future, we will be forced to choose well the column names, the descriptions, if we want to use AI agents to probe the data.

Code Generation: Brute-Force That Works

When it comes to code generation, the LangChain agent prompts the LLM to create SQL queries. These queries are then tested against the database until a successful query is achieved. Even in case of a SQL syntax error, the agent feeds the error back into the LLM, which then refines the SQL query, exemplifying an effective 'brute-force' approach.

Dangerous Inferences: Taking Shortcuts

It seems that, whether human or machine, the tendency to take shortcuts to avoid effort or mental fatigue is a shared trait. If an AI learnt from code thrown at it, it also learnt to take shortcuts, just like we do when we program. Interestingly, the LLM is capable of inferring that two columns contain similar information, even if the data types are different. The inference was incorrect, though. Unless you know the data and you test these inferences, you will not know you have been given the wrong answer.

In conclusion, LangChain and Azure OpenAI have shown remarkable capabilities in interacting with structured data, writing SQL queries, and performing table joins. As we move forward, the importance of metadata, data quality and the ability of humans to detect logic errors, will continue to play a significant role in achieving accurate results.

References

My inspiration comes from Valentina Alto's post, Building your own DB Copilot for Azure SQL with Azure OpenAI GPT-4.

Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about having conversations with your data. If you wish to get more information, please write me an email.