Data, SQL, and Analytics

An analyst's job involves constant context-switching between a SQL client, a Jupyter notebook, and schema documentation. Asking "what is the product_id column?" means opening DBeaver, finding the table, reading the comment (if one exists), switching back to the notebook, and trying not to forget what you were about to write. Claude Code collapses this process: the agent looks up the schema, generates SQL, verifies it, and edits the notebook — all in a single context.

Быстрое повторение

Какая главная проблема в классическом workflow аналитика, которую решает Claude Code?

Connecting a Database via MCP

Ready-made MCP servers exist for most relational databases. The official one for PostgreSQL is @modelcontextprotocol/server-postgres:

claude mcp add postgres -- npx -y @modelcontextprotocol/server-postgres \
  "postgresql://localhost/analytics"

Once connected, the agent gains tools for listing tables, reading the schema of a specific table, and executing SQL queries. These are all standard MCP tool primitives — see Model Context Protocol: Architecture and Fundamentals for more on the underlying mechanism.

To lock in the configuration for your entire team, add the server to .mcp.json in the repository root:

{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres", "${DATABASE_URL}"],
      "env": {
        "DATABASE_URL": "${DATABASE_URL}"
      }
    }
  }
}

The secret is passed via an environment variable rather than stored in the file — an essential practice for shared team configs. For more on MCP server scopes (local, project, user), see Connecting MCP Servers in Claude Code.

sequenceDiagram participant А as Analyst participant CC as Claude Code participant MCP as Postgres MCP participant DB as PostgreSQL А->>CC: «Show revenue by category for Q1» CC->>MCP: describe_table(orders) MCP->>DB: INFORMATION_SCHEMA query DB-->>MCP: table structure MCP-->>CC: schema orders, products, order_items CC->>MCP: execute_query(SELECT category, SUM(...)) MCP->>DB: executes query DB-->>MCP: aggregated data (20 rows) MCP-->>CC: result CC-->>А: summary + interpretation

sequenceDiagram
    participant А as Analyst
    participant CC as Claude Code
    participant MCP as Postgres MCP
    participant DB as PostgreSQL

    А->>CC: «Show revenue by category for Q1»
    CC->>MCP: describe_table(orders)
    MCP->>DB: INFORMATION_SCHEMA query
    DB-->>MCP: table structure
    MCP-->>CC: schema orders, products, order_items
    CC->>MCP: execute_query(SELECT category, SUM(...))
    MCP->>DB: executes query
    DB-->>MCP: aggregated data (20 rows)
    MCP-->>CC: result
    CC-->>А: summary + interpretation

Agent loop when working with a database via the Postgres MCP server

SQL Workflow Patterns

"Schema First" — the Foundation of Good Queries

The first rule: before asking for SQL, give the agent the schema. When Claude Code is connected to the database via MCP, it fetches the schema automatically on first access. Without MCP, provide it explicitly:

Here is the schema for our analytics DB:
@schema/analytics.sql

Write a query that calculates revenue by product category
for the last 30 days, broken down by week.
Tables: orders, order_items, products.

The @ symbol loads the file into context without any extra steps — see CLAUDE.md and the Memory System for details.

Iterating on SQL in Dialogue

SQL rarely comes out right on the first try — especially when business metric logic is complex. Work iteratively: treat the first query as a draft, and each follow-up message as a targeted refinement:

[first response — query generated]

Good, but we need to exclude test orders (status = 'test')
and include only completed ones (status IN ('shipped', 'delivered')).
Update the query.

The dialogue preserves the schema context and the previous logic, so each correction is minimal and precise.

Verification via EXPLAIN and Dry-Run

Before running a heavy analytical query against a production database, request the execution plan:

Before running, do an EXPLAIN ANALYZE and show the plan.
If there's a seq scan on a table with >10M rows —
suggest an index or rewrite the query.

The agent will spot that a join is on an unindexed column and warn you before the query could have taken down a replica. After EXPLAIN, send a separate message: "now execute it."

Проверь себя

Why run EXPLAIN ANALYZE before a heavy analytical query instead of running the query itself? When is this especially important?

Jupyter Notebooks: Two Approaches

Claude Code supports .ipynb files through two distinct mechanisms that differ significantly in capability.

Built-in NotebookEdit

A dedicated tool that works out of the box with no configuration. It correctly parses the .ipynb file structure: adding, replacing, and deleting cells, and clearing stale outputs when code changes.

The limitations are significant: no kernel access — the agent cannot execute a cell and read its output. Additionally, cell content is written as a single JSON string with \n separators rather than as an array of strings like standard Jupyter produces — this creates noise in git diffs and can break when editing manually in JupyterLab.

NotebookEdit is suitable for quick structural edits: renaming a variable throughout the notebook, adding a Markdown heading, or reordering cells.

Jupyter MCP Server (Recommended for Analysis)

An open-source third-party MCP server that gives the agent full access to a running Jupyter kernel: executing cells, reading output (text, tables, charts), catching errors, and fixing them in the next cell. This is a true agentic loop inside the notebook.

Setup takes about 10 minutes:

# 1. Install the server
pip install jupyter-mcp-server

# 2. Start Jupyter
jupyter lab

# 3. Add to Claude Code
claude mcp add jupyter -- python -m jupyter_mcp_server

After this, Claude Code works autonomously: writes a cell → executes it → reads the traceback → fixes it → runs it again. This is exactly the loop you need for EDA.

Exploratory Data Analysis (EDA) Patterns

A typical EDA session with Jupyter MCP Server is built step by step — one question at a time.

Step 1 — Dataset overview:

Load data/transactions_2024.csv into pandas,
show the shape, column types, missing value rates,
and basic statistics for numeric fields.

Step 2 — Target question:

Top 10 categories by total transaction amount for Q4.
Build a bar chart with seaborn, sorted in descending order.

Step 3 — Anomalies:

Are there outliers in the amount column? Use the IQR method.
If so — show them on a boxplot and print their share of total rows.

The key point: "Analyze everything" is a poor EDA prompt. The agent starts generating random charts with no understanding of the goal. Ask a specific question, review the result, then formulate the next one.

For Data Engineers: dbt and ETL

Claude Code integrates naturally into an engineering stack.

dbt projects. The agent reads models and the sources schema, and understands the transformation context. A typical prompt:

Look at @models/staging/stg_orders.sql
and the corresponding @schema.yml.

Write an intermediate model int_orders_enriched that adds:
- the full customer name from dim_customers
- the product category from dim_products

Include column descriptions in schema.yml.

The agent generates the model, grain comments, and tests — in the style of the existing files alongside it.

Debugging ETL pipelines. If an Airflow DAG has failed, give the agent the log and the file path:

Here is the log from the failed DAG:
[paste error]

Find the cause in @dags/load_transactions.py and suggest a fix.

Auto-documentation. The agent writes SQL comments and dbt descriptions well — especially when good examples from well-documented models are nearby. Provide two or three samples and new models will follow the same style.

Limitations and Caveats

Access privileges. Connect the agent through a user with minimal permissions. For analytical tasks, SELECT and CREATE TEMP TABLE are sufficient. Giving the agent DROP, TRUNCATE, or UPDATE on a production database is a bad idea regardless of Claude Code's permission settings.

Large result sets consume context. If a query returns 500,000 rows and the agent tries to print them all to the conversation, context will run out quickly. Always add a LIMIT to exploratory queries, or ask the agent to aggregate the data rather than dumping raw rows. See Managing the Context Window for strategies.

SQL needs to be verified. The agent confidently generates queries that look correct but may contain logical errors: wrong join granularity, missing partition filters, unexpected deduplication. Always validate results against a small sample (LIMIT 100, WHERE date = '2024-01-01') before running on the full dataset.

NotebookEdit ≠ code execution. Do not expect cell execution results from the built-in tool — they aren't there. If you need a real "write → run → fix" loop, spend 10 minutes setting up the Jupyter MCP Server.