Building Database Agent — using Hugging Face and Lang Chain
Database agents is interesting way to interact with the complex data and extract the relevant data without needing of query languages like SQL. The LLM can perform the database request for you regardless of database provider, data model or type language. We can use the abstraction layer as LLM to which we can interact with natural language and LLM can do the necessary interaction of database for us.
You dont need to be expert in creating the queries and getting the data from database. You just have to use LLM to do that for you. Here is a small project how we can leverage the LLM agent to do the necessary querying for the database for us while we explain our requirement in plain English.
Instead of database, I would like to use a plain CSV file that has around 20k records of the health data of different states in USA. Let us start building it
Step: 1 — Build AI agent using Langchain and Hugging Face LLM
Create the model using Transformers library and as I would like to run on CPU, it is always good to use light weight LLM for which in this scenario I used “Qwen/Qwen2.5–0.5B-Instruct” LLM
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_huggingface import ChatHuggingFace,HuggingFacePipeline
from langchain.schema import HumanMessage, SystemMessage
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct" # model you chose
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="cpu", # force CPU
torch_dtype=torch.float32, # CPU runs best in float32
trust_remote_code=True,
)
gen = pipeline(
"text-generation",
model=model,
tokenizer=tok,
max_new_tokens=128,
do_sample=True,
)
# 1) Wrap HF pipeline as a LangChain LLM
hf_llm = HuggingFacePipeline(pipeline=gen)
- MODEL_ID: Chooses which model to load from Hugging Face. Here you’re using Qwen2.5-0.5B-Instruct, a small instruction-tuned LLM.
- Tokenizer: Converts human text into numerical tokens that the model can understand. use_fast=True makes tokenization faster using Rust-based implementation.
- AutoModelForCausalLM: Loads a model for text generation (causal language modeling = predicting next word).
- device_map="cpu": Forces it to run on CPU instead of GPU.
- torch_dtype=torch.float32: Ensures stable numerical precision on CPU (faster and safer than float16 on CPU).
- trust_remote_code=True: Allows custom model code from Hugging Face repo (needed for some community models).
- Pipeline: High-level Hugging Face wrapper that makes inference easy.
- Task = “text-generation”: Tells pipeline you want autoregressive text output.
- max_new_tokens=128: The model can generate up to 128 new tokens.
- do_sample=True: Enables randomness (sampling) instead of greedy output → makes answers more varied.
So gen("Hello") would produce a continuation of "Hello ...".
Finally we are using Langchain wrapping.
- Hugging Face pipelines are great, but LangChain expects an LLM interface.
- HuggingFacePipeline acts as an adapter so you can use this model inside LangChain workflows (chains, agents, etc.).
This means you can now pass SystemMessage and HumanMessage into the model like a chatbot.
Step: 2 — Read CSV using pandas and create Pandas Dataframe Agent
Now that we have the LLM ready with the langchain agent integrated, We can read the file CSV using pandas library and then create the dataframe agent
import pandas as pd
df = pd.read_csv("./all-states-history.csv").fillna(value = 0)
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
agent = create_pandas_dataframe_agent(
llm=hf_llm,
df=df, # your existing DataFrame
allow_dangerous_code=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
include_df_in_prompt=False, # keep input small
number_of_head_rows=0,
max_iterations=4,
max_execution_time=30,
return_intermediate_steps=True,
verbose=False,
agent_executor_kwargs={"handle_parsing_errors": True},
)
- pd.read_csv() → loads the CSV file into a Pandas DataFrame (df).
- .fillna(value=0) → replaces all missing values (NaN) with 0, so queries won’t break when the model encounters missing data.
Now df is a structured table with rows and columns.
- AgentType → tells LangChain how to reason about tasks (different strategies).
- create_pandas_dataframe_agent → creates an agent that can “talk” to your Pandas DataFrame using natural language + code execution.
- llm=hf_llm → This is the Hugging Face pipeline you wrapped earlier. The LLM interprets your questions and decides what code to run.
- df=df → The Pandas DataFrame with your CSV data.
- allow_dangerous_code=True → Lets the agent run generated Python code on the DataFrame. ⚠️ This is powerful but risky if you don’t trust your inputs.
- agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION → Strategy where the LLM is given the description of tools and has to decide step-by-step what to do (“React” = Reason + Act).
- include_df_in_prompt=False → Doesn’t dump the entire DataFrame into the prompt (saves tokens). Instead, the agent figures out which code snippets to run.
- number_of_head_rows=0 → If >0, the first N rows of the DataFrame would be shown to the LLM in the prompt. Here you skip it (again, to save tokens).
- max_iterations=4 → The agent can make at most 4 reasoning/code execution steps before giving up.
- max_execution_time=30 → Stop if execution takes more than 30 seconds.
- return_intermediate_steps=True → Useful for debugging: you can see what reasoning steps/code the agent took.
- verbose=False → Suppresses extra logs.
- agent_executor_kwargs={"handle_parsing_errors": True} → Helps handle errors when the model outputs invalid JSON or code.
Now if we want the data to be accessed from SQL we can move the data from CSV to SQL
# Path to your SQLite database file
database_file_path = "./test.db"
# Create an engine to connect to the SQLite database
# SQLite only requires the path to the database file
engine = create_engine(f'sqlite:///{database_file_path}')
file_url = "./all-states-history.csv"
df = pd.read_csv(file_url).fillna(value = 0)
df.to_sql(
'all_states_history',
con=engine,
if_exists='replace',
index=False
)
Here we can directly use the SQL database toolkit to load into the agent
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
db = SQLDatabase.from_uri(f'sqlite:///{database_file_path}')
toolkit = SQLDatabaseToolkit(db=db, llm=hf_llm)
- create_sql_agent → builds an agent that can query SQL databases in natural language.
- SQLDatabaseToolkit → provides the agent with the tools it needs (like running SQL queries, inspecting schema).
- SQLDatabase → a LangChain wrapper around your database connection (SQLite, MySQL, Postgres, etc.).
- from_uri: Creates a connection from a database URI.
Here you’re connecting to an SQLite database file. Example: if database_file_path = "mydata.db", this connects to sqlite:///mydata.db.
- SQLDatabase not only connects but also lets LangChain inspect schema (tables, columns, datatypes) so the LLM knows what data is available.
SQLToolkit bundles together:
- Your database (db),
- Your LLM (hf_llm),
- Tools the agent will use:
- Look up database schema (so it knows what tables/columns exist).
- Generate and run SQL queries.
- Return results back in a readable way.
Time to create the agent and then run it
QUESTION = """How may patients were hospitalized during October 2020
in New York, and nationwide as the total of all states?
Use the hospitalizedIncrease column
"""
agent_executor_SQL = create_sql_agent(
prefix=MSSQL_AGENT_PREFIX,
format_instructions = MSSQL_AGENT_FORMAT_INSTRUCTIONS,
llm=hf_llm,
toolkit=toolkit,
top_k=30,
verbose=True
)
agent_executor_SQL.invoke(QUESTION)
This piece of code is where everything comes together: you’re building a LangChain SQL agent that can answer natural language questions by generating and executing SQL queries. The agent will need to generate SQL queries that filter on date and state, then sum that column.
- prefix=MSSQL_AGENT_PREFIXThis is a custom prompt prefix you supply. It usually contains instructions like: “You are an agent that queries a Microsoft SQL database. Always use T-SQL syntax. Only output SQL queries inside sql tags.” Helps the LLM stick to valid SQL formatting.
- format_instructions=MSSQL_AGENT_FORMAT_INSTRUCTIONSAnother prompt helper that tells the agent how to format response. Example: “Return the SQL you executed and the final answer as plain text.
- llm=hf_llmYour Hugging Face model (Qwen here) is the reasoning engine.
- toolkit=toolkitThe SQLDatabaseToolkit created earlier, which lets the LLM: Inspect the schema, Run SQL queries, Return results
- top_k=30When the LLM inspects the database schema, it can limit how many table descriptions / column details it sees. Here, up to 30 columns/tables will be retrieved for context.
- verbose=TruePrints the step-by-step reasoning, generated SQL queries, and execution results to your console. Very helpful for debugging.
- invoke() sends your natural language QUESTION into the agent.

Step:3 — (Optional) Creating tools to pass to the Langchain Framework for more cleaner code
We can create the tools that does the task we can define and then tag them to the Langchain so that the agent interacts with the tools to generate the output
Here I’m specifying an example where it is specific to the hospital dataset that I have taken.
import numpy as np
from sqlalchemy import text
def get_hospitalized_increase_for_state_on_date(state_abbr, specific_date):
try:
query = f"""
SELECT date, hospitalizedIncrease
FROM all_states_history
WHERE state = '{state_abbr}' AND date = '{specific_date}';
"""
query = text(query)
with engine.connect() as connection:
result = pd.read_sql_query(query, connection)
if not result.empty:
return result.to_dict('records')[0]
else:
return np.nan
except Exception as e:
print(e)
return np.nan
Now on creating what the task is we will be defining the tools format
tools_sql = [
{
"type": "function",
"function": {
"name": "get_hospitalized_increase_for_state_on_date",
"description": """Retrieves the daily increase in
hospitalizations for a specific state
on a specific date.""",
"parameters": {
"type": "object",
"properties": {
"state_abbr": {
"type": "string",
"description": """The abbreviation of the state
(e.g., 'NY', 'CA')."""
},
"specific_date": {
"type": "string",
"description": """The specific date for
the query in 'YYYY-MM-DD'
format."""
}
},
"required": ["state_abbr", "specific_date"]
}
}
}]
We can use the tool calls using the langchain tools
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools_sql,
tool_choice="auto",
temperature=0.0,
)
available_functions = {
"get_positive_cases_for_state_on_date": get_positive_cases_for_state_on_date,
}
import re
msg = response.choices[0].message
tool_calls = msg.tool_calls or []
if not tool_calls and (msg.content or "").find("<tool-use>") != -1:
m = re.search(r"<tool-use>(.*?)</tool-use>", msg.content, re.S)
if m:
blob = json.loads(m.group(1))
for call in blob.get("tool_calls", []):
name = call["function"]["name"]
args = call.get("parameters", {}) # convert 'parameters' → arguments
result = available_functions[name](**args)
messages += [
msg,
{"role": "tool", "tool_call_id": call.get("id", "manual"), "content": str(result)}
]
# then ask for the final answer
final = client.chat.completions.create(model=model, messages=messages)
print(final.choices[0].message.content)
else:
print(msg.content)
We are extracting the output of the tools and send to the UI where we would need the output. Tools might be other big topic that we might have to explore more deeper with other use cases. For now, The above sql agent can be leveraged with the tool calling facility of the Langchain.
For more details you can look into my Jupyter notebook
Happy Learning!!
