Technical

How to Talk with Your Spreadsheet Data

November 19th, 2025

Tomás Spoturno

Machine Learning Engineer

Santiago Ferreiros

Tech Lead

From scattered sheets to conversational SQL—with agentic reliability

Why “talking to spreadsheets” often disappoints

Spreadsheets are the universal interface of business data. Teams use them for budgeting, inventory, sales tracking, and even as ad-hoc databases. Yet the same flexibility that makes them indispensable also makes them messy for automation.

Natural language interfaces to spreadsheets usually fail because they expect the LLM to guess meaning out of unstructured chaos: inconsistent headers, hidden rows, merged cells, ambiguous date formats, and silently broken formulas.

In practice, this happens when users upload a spreadsheet file (like an Excel or Google Sheet) directly into a general-purpose LLM such as ChatGPT, Gemini, or Claude, expecting it to “understand” the file’s content. The model receives only a flattened text representation of the sheet, losing its tabular structure, formulas, and context. Without a defined schema or validation layer, the LLM is forced to infer meaning from messy raw data, which leads to unreliable reasoning and inconsistent answers

What people imagine is: “Ask in English, get an answer.”

What actually happens is: “The model hallucinated a column that doesn’t exist.”

Before a model can reason reliably, it needs the same thing analysts and BI tools need: a schema. Defining that schema (explicit tables, clear types, sample data) is the difference between toy demos and production reliability. Only once the data has a structured contract can we attach a conversational layer that feels intelligent instead of fragile.

The failure isn’t the LLM, it’s the missing structure and guardrails. Talking to spreadsheets becomes feasible only when we treat them like a database first and let the LLM operate within a controlled, validated environment.

Uploading spreadsheets directly into ChatGPT or Claude often disappoints because these tools face hard technical and structural limits. Both cap file sizes (around 30–50 MB) and struggle with large, multi-sheet or formula-heavy files, sometimes truncating content to fit their context windows. They also misinterpret complex spreadsheet features such as hidden rows, merged cells, or formulas, since these models extract plain text rather than preserving data structure. Even when uploads succeed, context or token limits mean only part of the data is analyzed, leading to incomplete or misleading answers. Frequent issues like slow performance, crashes, or model fallbacks further undermine reliability.

Fundamentally, this happens because LLMs operate semantically, not structurally. They understand natural language, not the relational logic of data tables. The new agent modes in ChatGPT and Claude attempt to compensate by generating and executing Python code to inspect and interpret spreadsheets, but this approach remains error-prone and non-deterministic since the model must first infer the data’s structure, then write code to parse it correctly. In contrast, a SQL agent works over an explicit, validated schema where reasoning is grounded in structured queries rather than improvised scripts. This allows safe, auditable, and precise computation, turning semantic intent into deterministic results instead of guessing through ad-hoc code generation.

So how can we actually make it work?

After understanding why “talking to spreadsheets” often fails, the next question is obvious: how can we make it work?

Instead of uploading raw Excel or Google Sheets files into ChatGPT, Claude, or Gemini and hoping the model can interpret the chaos, we can build a reliable agent that truly reasons over our data.

The key isn’t magic, it’s engineering... The idea is to turn spreadsheets into relational schemas, load them into a validated environment, and let a model operate within clear, auditable constraints. Unlike ChatGPT or Claude’s agent modes, which must first generate and execute Python code to “guess” a spreadsheet’s structure, a SQL agent operates on an explicit, validated schema. This allows deterministic reasoning, safer execution, and full traceability.

Model the data → Spreadsheet(s) → Relational schema (tables, columns, keys, descriptions, a few sample rows per table)‍
Load & validate → ETL/ELT to a warehouse (e.g., Postgres/BigQuery/Snowflake)‍
Ground the LLM → RAG/Schema Linking to select only relevant tables/columns‍
Generate SQL → LLM proposes a query (few‑shots + business context)‍
Pre‑execution validation → SQL parser + policy checks (SELECT‑only, LIMIT, existence checks)‍
Execute with least privilege → Read‑only DB user‍
Answer → Turn rows into a concise, business‑friendly reply

Step 1 — From spreadsheets to a trustworthy schema

Before any agent can “talk” with data, it must first understand its structure. Imagine inheriting a dozen messy spreadsheets, columns with vague names, random totals, and hidden rows. The first step is to turn that chaos into a clear story the model can follow. Design a schema where each column name speaks for itself, every table has a purpose, and relationships are explicit. Add short descriptions and a few example rows to give context, much like a quick scene setting in a movie. Normalizing avoids duplication, while a bit of denormalization makes frequent questions easy to answer. Once that foundation exists, the agent finally has something reliable to reason about.

In resume:

Columns must be descriptive: customer_id, order_date, gross_amount_usd
Add table/column descriptions that use business terms
Include 3 synthetic sample rows as comments per table (helps few‑shot grounding)
Normalize enough to avoid duplication; denormalize sparingly for common queries

Example DDL skeleton

-- Customers
CREATE TABLE dim_customer (
  customer_id SERIAL PRIMARY KEY,
  full_name TEXT NOT NULL, -- e.g., legal or preferred name
  email TEXT, -- unique if enforced upstream
  created_at TIMESTAMP NOT NULL
);
-- Example rows (for the agent; do not load in prod)
-- INSERT INTO dim_customer (customer_id, full_name, email, created_at) VALUES
-- (101, 'Ana Pereira', '[email protected]', '2024-01-04 09:12:00'),
-- (102, 'Carlos Méndez', '[email protected]', '2024-01-12 15:33:00'),
-- (103, 'Lucía Díaz', '[email protected]', '2024-02-01 11:05:00');

-- Orders
CREATE TABLE fct_order (
  order_id SERIAL PRIMARY KEY,
  customer_id INT NOT NULL REFERENCES dim_customer(customer_id),
  order_date DATE NOT NULL,
  gross_amount_usd NUMERIC(12,2) NOT NULL,
  channel TEXT CHECK (channel IN ('web','store','partner'))
);
-- Example rows
-- INSERT INTO fct_order (order_id, customer_id, order_date, gross_amount_usd, channel) VALUES
-- (5001, 101, '2024-03-01', 120.00, 'web'),
-- (5002, 102, '2024-03-02', 60.50, 'store'),
-- (5003, 101, '2024-03-05', 240.00, 'partner');

Step 2 — ETL pipeline (From sheets to Database)

This is where discipline meets data. The goal is to take every spreadsheet and move it through a flow: extraction, validation, and loading. Using tools like pandas and SQLAlchemy, the data is read, cleaned, and written into the warehouse with consistent types and referential integrity. Think of it as training customs officers for your data, each file must pass inspection before entering the database. The pipeline not only structures the flow (Stage → Clean → Dim/Facts) but also ensures the resulting tables reflect the DDL you defined earlier. Once validated, the database becomes a single source of truth that your agent can safely explore.

Extract: Read .xlsx/.csv from Drive/SharePoint/S3 (e.g., Python pandas)
Validate: Types, nullability, referential integrity‍
Load: Warehouse of choice; run DDL above first
‍Transform: Stage → Clean → Dim/Facts (use dbt or SQL scripts)

Step 3 — Grounding & retrieval (Schema Linking / mini‑RAG)

If the database schema is small enough, you can include all its table and column descriptions directly inside the system prompt of the LLM; otherwise, you need a grounding and retrieval layer to avoid overloading the model with irrelevant context.

This step ensures the LLM only sees the parts of the schema that matter for each question, improving accuracy and reducing hallucinations. The process works by indexing lightweight schema metadata (such as table and column names, descriptions, sample values, and key relationships) and retrieving the most relevant fragments using vector or keyword search.

When a user asks a question, the system parses it, retrieves the relevant tables and columns, and constructs a compact, focused context for the model to generate SQL queries safely. In short, grounding acts as the LLM’s map to your database, scaling from a single small schema embedded in the prompt to a retrieval-driven approach for larger warehouses, ensuring efficient, accurate, and reliable query generation.

Retrieval objects

Table names, table description, column names, descriptions, table use cases
Optional: top value samples per column (e.g., common channels), query examples

Retrieval loop

Parse user question
Retrieve candidate tables/columns (vector or keyword search)
Build the model context: only those tables + few‑shot examples

Step 4 — Prompting for reliability (few‑shots + policy)

Now the agent needs to learn how to “speak SQL” responsibly. Through prompt design, you’re not just giving instructions, you’re shaping its reasoning habits. The system prompt provides the tone, context, and rules of engagement: it should always limit results, check syntax, and never modify data. Few-shot examples act as prior experiences that guide the agent toward safe, structured answers. Policies like SELECT only and LIMIT clauses transform the prompt from casual conversation into a contract of trust between human and model.

System prompt snippet

You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {db.dialect} query to run, then look at the results of the query and return the answer. Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most 5 results.

You can order the results by a relevant column to return the most interesting examples in the database. Never query for all the columns from a specific table, only ask for the relevant columns given the question.

You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the
database.

# DATABASE SCHEMA

Below is the complete database schema for the Zonamerica database:

## Master Tables (edificios, locales, clientes)
{master_tables_sql}

## Secondary Tables
{secondary_tables_sql}

## Indexes
{indexes_sql}

# EXAMPLE QUERIES

Below are example queries that demonstrate that solve a set of questions. You can use this same queries:

{example_queries_sql}

Use these schemas and examples to guide your query construction.

Under the variable example_queries_sql is important to add a few-shot of real examples of queries that the agent needs to answer.

Step 5 — Build the agent (LangChain v1 + SQL Toolkit)

Below is a compact pattern you can adapt. If the full schema context fits in‑prompt, the list‑tables, schema, query‑checker, and execute tools are sufficient. (Langchain SQL Agent)

from langchain_openai import ChatOpenAI
from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain_community.utilities import SQLDatabase
from langchain.agents import create_sql_agent

# 1) DB with READ-ONLY user
sql_db = SQLDatabase.from_uri("postgresql+psycopg2://agent_ro:***@host:5432/warehouse")

# 2) LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# 3) Toolkit exposes: list tables, get schema, check query, run query
toolkit = SQLDatabaseToolkit(db=sql_db, llm=llm)
tools = toolkit.get_tools()

# 4) Create the agent with built-in SQL tools
agent = create_agent(
    llm=llm,
    tools=tools,
    system_prompt=system_prompt,
)

# 5) Ask a question
question = "Show total sales by channel last month"
for step in agent.stream(
	{"messages": [{"role": "user", "content": question}]},
	stream_mode="values",
):
	step["messages"][-1].pretty_print()

Upgrade path: For larger schemas, move to a LangGraph state machine: nodes for (list tables → choose → fetch schema → generate → check → execute → refine). Add retries on DB error messages.

If your database schema with few-shot examples fits all inside the system prompt then you can ignore the following tools of the SQL Toolkit: sql_db_list_tables and sql_db_schema.

Step 6 — Pre‑execution validation & security

Before any SQL touches the database, every query must pass through a validation and policy layer. This step ensures the agent never executes unsafe, nonsensical, or privilege-escalating commands. LangChain v1 introduces the middleware concept, which fits perfectly here: you can attach logic that inspects, modifies, or blocks a tool call before it runs.

In this layer, you can:‍

Enforce policies. Allow only SELECT statements, require LIMIT clauses, restrict access to specific schemas or tables, and block dangerous keywords such as DELETE, DROP, ALTER, or UPDATE. ‍
Parse and lint queries. Run the LLM-generated SQL through a parser to verify syntax and ensure table and column names exist in the schema. This step can also normalize capitalization or formatting for traceability.‍
Perform static analysis. Detect cartesian joins, excessive scans, or unfiltered aggregations that might return huge result sets.‍
Validate against business rules. For example, reject queries that expose sensitive fields (PII) or violate row level security policies.
Optionally involve a human or policy engine. Middleware can interrupt execution to request confirmation or approval for certain query types, enabling a human-in-the-loop workflow in production.

By centralizing these checks inside middleware, you decouple validation logic from the agent’s reasoning flow. The model can freely generate SQL, but actual execution happens only if the middleware deems it compliant with organizational and security policies. Combined with a read-only database user, this creates a strong defense-in-depth boundary between language generation and data access.

Step 7 — Observability & traceability

You need to see what the agent thought and did (LangSmith):

LLM traces (inputs, tool calls, outputs)
Query logs (validated SQL, execution time, row counts)
Error analytics (top failure modes: unknown column, bad join, empty set)
PII guardrails (masking, redact in traces)

Possible improvements & next steps

Your current architecture (schema ⇒ ETL ⇒ grounding ⇒ SQL gen ⇒ validation ⇒ execute ⇒ refine) is a solid base to ship an internal pilot. Below are concrete upgrades that harden it for production scale and messy enterprise data.

Stronger grounding: Schema-RAG & value-aware linking

Schema RAG at scale. Index table/column descriptions, PK/FK graph, and top values, retrieve only what’s relevant per question. This reduces hallucinations and improves join accuracy. Recent surveys highlight retrieval/schema-linking as the biggest win when moving from demos to prod. (arXiv)
Robust schema linking. Prefer linkers that are resilient to naming drift and synonyms and treat linking as a first-class retrieval problem. (arXiv)

Constrained decoding + SQL lint/validation

Syntax safe generation (PICARD). Constrain decoding so the model cannot emit invalid SQL. This measurably boosts execution success. (aclanthology.org)
Programmatic guardrails. Parse and lint with a real parser (e.g., SQLGlot) and add allow/deny policies (SELECT-only, required LIMIT, schema allowlist). You can also adopt ready validators (e.g., Guardrails’ valid_sql) in the tool-call middleware. (GitHub)
Policy exceptions path. When rules trigger (PII columns, row-count risk), route to approval (see §4). (Pairs naturally with your “pre-execution validation” step.)

Error-aware self-correction & planning

Execution-feedback loops. On DB errors (unknown column, bad join), auto-repair the query using the error message + schema context; keep a capped retry budget. Industry patterns and recent papers show large gains from this “reflect → repair → re-run” loop. (Amazon Web Services, Inc.)
Reasoning scaffolds. Use light planning (outline → SQL draft → check) and dynamic few-shots pulled from successful past queries against the same tables. Surveys note this consistently helps on real schemas. (arXiv)

Human-in-the-loop (HITL) where it matters

Risk based review. Auto approve safe patterns and require human approval for queries touching sensitive tables, high cost, or ambiguous intent.
Feedback → training data. Capture user edits/approvals and mine them into new few-shots or synthetic training examples; HITL pipelines are now standard to lift accuracy on your own workloads. (arXiv)

Observability& Governance

Traces you can trust. Log prompts, retrieved schema shards, generated SQL, policy decisions, execution time, row counts, and user feedback tags.
Benchmarks that reflect reality. Track execution accuracy and answer helpfulness on your logs; optionally add BIRD/Spider style evals to catch regressions. (Medium)
Failure taxonomy. Classify top failure modes (linking, syntax, empty set, slow query) and target them with tests.
Least privilege by design. Read-only role, schema allowlist, row-level security, and column masking for PII; redact PII from traces.
Guardrail framework. Centralize input/output validation (prompt filters, schema/SQL validators, result size caps, jailbreak checks). See recent guardrails surveys/patterns. (arXiv)

Production checklist

Before shipping your SQL agent to production, it helps to have a quick sanity check, a short list of non-negotiables that ensure your system behaves predictably under real conditions.

The following checklist distills everything discussed in this post into concrete, verifiable steps.

Read‑only DB role for agent (least privilege)
SQL parser + policy checks (SELECT‑only, LIMIT, WHERE rules)
Scoped schema context or RAG
Few‑shot examples aligned to business questions
ETL with type checks & primary/foreign keys
Observability (LLM traces, query logs, error taxonomy)
Row‑level security / tenant scoping if needed
Clear fallback when the question is ambiguous (ask clarifying question)

Conclusion

“Talking” with spreadsheets is not about making language models magically understand chaos, it is about giving them structure, guardrails, and feedback loops that mirror how reliable software systems work. Once a spreadsheet becomes a schema and queries flow through validated, observable pipelines, the conversation stops being fragile and starts becoming productive.

The core message is simple: treat your spreadsheet like a database and your LLM like an analyst who needs context and supervision. With a minimal ETL, schema-aware retrieval, SQL validation, and least privilege execution, you can move from copy pasting CSVs into chat to a maintainable conversational layer that scales.

This is the direction the next generation of SQL agents is taking: grounded, governable, and measurable. Whether you are piloting a small internal tool or designing an enterprise data interface, reliability is not a bonus feature, it is the foundation. Build that foundation once, and your LLM will finally start talking with your data, not just about it.