ML Experiments and Pipeline Management — Claude Code: Developer Reference

ML Experiments and Pipeline Workflows

The ML cycle is fundamentally different from regular software development: there is no "done" — only a metric, a hypothesis, an experiment, and another hypothesis. Switching between reading data, writing transformations, configuring tracking, and debugging pipelines takes a disproportionate amount of time — and this is precisely where Claude Code inserts an agentic layer. But this tool has a clear boundary: it can write code, read files, and run commands, but it does not see numbers in real time and does not interpret metrics for you.

Where the Agent Genuinely Accelerates the ML Cycle

Practitioners typically lose time not in training itself, but around it: scaffolding configs, preparing data, writing repetitive boilerplate, debugging a failed pipeline step. This is exactly where Claude Code spends seconds instead of hours.

Data preparation. The agent is good at reading dataset structure and writing clean transformations:

Look at @data/raw/train.parquet and @src/features/config.yaml.

Write a script src/features/build_features.py that:
- normalizes numeric columns according to the list in the config
- encodes categorical columns via target encoding (use category_encoders)
- splits into train/val using a stratified split (stratify by label)
- saves to data/processed/ in parquet format

At the top of the script — assert checks for expected columns and types.

The agent writes such a script in the style of existing code nearby — if you provide examples from the project via @. This is the same "schema first" principle that worked with SQL, only now instead of a DB schema you have a dataset structure and a feature config.

Experiment scaffolding. Creating a new experiment with MLflow is routine work that takes 20–30 minutes by hand (configs, logging, saving artifacts). The agent does it from a template:

The project already has experiments/baseline/ with MLflow tracking.
Create experiments/attention_pooling/ with the same structure:
- train.py with logging of hyperparameters, per-epoch metrics, and final val_f1
- config.yaml with model parameters
- README.md describing the hypothesis

Hypothesis: replacing mean pooling with attention pooling in the classifier will improve F1
despite a +15% increase in parameters.

Important: you write the hypothesis and the interpretation of results. The agent creates the infrastructure to test it.

Проверь себя

The agent wrote a script to prepare data with target encoding. You ran it and got val_f1 = 0.89 — which looks good. What potential risk should you check before trusting that number?

Jupyter vs. marimo: What Works with the Agent

The previous article on Data, SQL and Analytics described two paths for Jupyter: the built-in NotebookEdit (without kernel access) and Jupyter MCP Server (with a full write → run → read-output cycle). In an ML context, one more recommendation is added.

marimo is a reactive replacement for Jupyter that stores notebooks as plain .py files. Claude Code works with it like any Python file: reads, edits, and runs it via bash. No JSON, no hidden cell states — and no special MCP server for the basic cycle. The marimo run notebook.py command runs a reproducible notebook without the "dirt" of cell execution order.

# Installation
pip install marimo

# Running an existing notebook in Claude Code
claude
# then in the dialogue:
# > run marimo run experiments/eda.py and show the output

For heavy EDA where you need to see charts in real time — Jupyter MCP Server is still the better choice. For ML scripts where reproducibility matters — marimo or plain .py files.

flowchart TD A[Data / raw] --> B[Feature preparation agent writes script] B --> C[data/processed/] C --> D[Experiment scaffolding agent creates structure] D --> E[train.py + config.yaml + MLflow/W&B logging] E --> F[Launch training you / CI] F --> G[Artifacts and metrics MLflow / W&B] G --> H{Results analysis agent or you?} H -->|Structured analysis| I[claude -p on mlruns/] H -->|Interpretation| J[You: hypothesis selection] J --> D E --> K[Reproducibility agent checks seeds, configs, versions] style H fill:#f5a623,color:#000 style J fill:#d0021b,color:#fff

flowchart TD
    A[Data / raw] --> B[Feature preparation<br/>agent writes script]
    B --> C[data/processed/]
    C --> D[Experiment scaffolding<br/>agent creates structure]
    D --> E[train.py + config.yaml<br/>+ MLflow/W&B logging]
    E --> F[Launch training<br/>you / CI]
    F --> G[Artifacts and metrics<br/>MLflow / W&B]
    G --> H{Results analysis<br/>agent or you?}
    H -->|Structured analysis| I[claude -p on mlruns/]
    H -->|Interpretation| J[You: hypothesis selection]
    J --> D
    E --> K[Reproducibility<br/>agent checks seeds,<br/>configs, versions]
    style H fill:#f5a623,color:#000
    style J fill:#d0021b,color:#fff

The ML cycle with Claude Code: where the agent automates, where a human is needed

Быстрое повторение

Почему в ML-работе с Claude Code выбирают marimo вместо Jupyter?

Experiment Tracking: Setup via the Agent

Setting up MLflow or Weights & Biases in a new project involves many small steps that are easy to delegate:

The project uses PyTorch and MLflow for tracking.
Currently train.py has no logging at all.

1. Add mlflow.start_run() with an automatic run name from the config
2. Log all fields from config.yaml as params
3. Log train_loss and val_loss every epoch
4. At the end — log_metric("best_val_f1", ...) and mlflow.pytorch.log_model()
5. Add mlflow>=2.10 to requirements.txt

Config structure example: @configs/baseline.yaml
train.py example: @experiments/baseline/train.py

The agent will write the boilerplate, preserve the style of existing code, and won't forget mlflow.end_run() inside a finally block — something humans often overlook. Afterward — review the code yourself and run it.

W&B is similar, just with different syntax. If the project already has one integrated script, give it to the agent as a template and ask it to port the tracking to a new experiment.

Проверь себя

You asked the agent to add MLflow tracking to train.py. The agent inserted `mlflow.start_run()` at the beginning of the script and `mlflow.end_run()` at the end. What will happen if training crashes with an exception halfway through — and what is the correct fix?

Pipelines: Debugging and Extension

ML pipelines on Airflow, Prefect, or Metaflow tend to fail at the most unexpected moments. Claude Code handles two scenarios well.

Debugging a failed pipeline. Give the agent the log and the code:

Here is the log from a failed DAG in Airflow:
<paste traceback>

DAG file: @dags/training_pipeline.py
Task file: @src/pipeline/train_task.py

Find the root cause, propose a fix, and add error handling for this case.

The agent sees the context of both files, reads the traceback, and often pinpoints the root cause immediately — a wrong artifact path, a type mismatch, a missing dependency between tasks.

Adding a new step. If the pipeline needs to be extended — give the agent the existing DAG as a template:

Look at the structure of @dags/training_pipeline.py.
Add a new step evaluate_on_holdout after the train_model task:
- loads the best checkpoint from MLflow by run_id
- runs inference on @data/holdout.parquet
- saves a report to reports/holdout_metrics.json
- marks the task as failed if f1 < 0.75

Preserve the style: same retry patterns, same Airflow variables, same logging format.

The @ symbol for loading files into context is the same mechanism described in CLAUDE.md and the Memory System. Here it works as "pass a style reference."

Reproducibility: the Agent as a Hygiene Checker

ML experiment reproducibility depends on many small details that are easy to miss: random seeds, library versions, fixed data ordering. You can ask Claude Code to walk through the code and find issues:

Check experiments/attention_pooling/train.py for reproducibility:
- Are all random seeds fixed? (torch, numpy, random, cuda)
- Are dependency versions pinned in requirements.txt?
- Is the data path not hardcoded in the code?
- Are results saved with a run_id rather than overwriting previous ones?

Apply the necessary fixes.

A typical result: the agent adds torch.backends.cudnn.deterministic = True, replaces a hardcoded path with Path(config.data_dir), and fills in exact versions in requirements.txt.

Configs instead of command-line arguments. If the project passes parameters through dozens of argparse arguments — that's a signal to switch to YAML configs (Hydra, OmegaConf). The agent can perform this refactoring:

Port @experiments/baseline/train.py from argparse to Hydra.
Configs should live in configs/, with the base config at configs/default.yaml.
Group configs: configs/model/, configs/data/.
Preserve all existing parameters.

After refactoring, each experiment run has a fixed config file — reproducing a run with the same parameters becomes trivial.

Проверь себя

After switching from argparse to Hydra, configs are stored in configs/. A colleague runs your experiment with the command `python train.py lr=0.001 batch_size=32`. What changes in terms of reproducibility compared to argparse?

Headless Mode for Running Experiment Sweeps

When you need to run a series of configs — for example, a grid search over learning rate and batch size — you can use Claude Code's headless mode together with a shell script. More on headless mode in Headless Mode and CLI Scripting; here is just the ML pattern:

# Run results analysis after a series of experiments
claude -p "
Look at mlruns/ and find all runs tagged with experiment=attention_pooling.
Compare val_f1 by learning_rate and batch_size.
Output a table: lr | batch | best_val_f1 | best_epoch
Highlight the config with the best val_f1.
" --output-format json > results/sweep_analysis.json

The agent reads MLflow artifacts directly from the filesystem and returns a structured result. No need to open the UI or write a pandas script for analysis.

Where Human Oversight Is Required

ML is a domain where misinterpretation is costly. Here is where the agent requires explicit review:

Metric selection. Claude Code will generate code for any metric you ask for — but choosing the right one (precision vs. recall with imbalanced classes, NDCG vs. MAP for ranking) is up to you. The agent does not know what matters most for your product.

Interpreting results. "val_loss stopped decreasing at epoch 8" — is that early stopping or overfitting? The agent sees the numbers but does not understand the context: how much data there is, how complex the task is, whether that val_f1 is acceptable for deployment.

Data leakage. The agent writes splits and transformations faithfully within the scope of what you described — but it does not always detect feature leakage from the future. Subtle leakage (e.g., target encoding applied before the split) requires human review.

Production data. The rule from the previous article applies doubly here: an agent with access to S3 or a remote database can modify data if asked carelessly. A separate read-only IAM user for the agent is not paranoia — it is standard practice.

Быстрое повторение

Интерпретацию каких результатов Claude Code не должен делать?

ML Experiments and Pipeline Workflows

Where the Agent Genuinely Accelerates the ML Cycle

Jupyter vs. marimo: What Works with the Agent

Experiment Tracking: Setup via the Agent

Pipelines: Debugging and Extension

Reproducibility: the Agent as a Hygiene Checker

Headless Mode for Running Experiment Sweeps

Where Human Oversight Is Required

See also

Источники