ML Experiments and Pipeline Workflows
The ML cycle is fundamentally different from regular software development: there is no "done" — only a metric, a hypothesis, an experiment, and another hypothesis. Switching between reading data, writing transformations, configuring tracking, and debugging pipelines takes a disproportionate amount of time — and this is precisely where Claude Code inserts an agentic layer. But this tool has a clear boundary: it can write code, read files, and run commands, but it does not see numbers in real time and does not interpret metrics for you.
Where the Agent Genuinely Accelerates the ML Cycle
Practitioners typically lose time not in training itself, but around it: scaffolding configs, preparing data, writing repetitive boilerplate, debugging a failed pipeline step. This is exactly where Claude Code spends seconds instead of hours.
Data preparation. The agent is good at reading dataset structure and writing clean transformations:
Look at @data/raw/train.parquet and @src/features/config.yaml.
Write a script src/features/build_features.py that:
- normalizes numeric columns according to the list in the config
- encodes categorical columns via target encoding (use category_encoders)
- splits into train/val using a stratified split (stratify by label)
- saves to data/processed/ in parquet format
At the top of the script — assert checks for expected columns and types.The agent writes such a script in the style of existing code nearby — if you provide examples from the project via @. This is the same "schema first" principle that worked with SQL, only now instead of a DB schema you have a dataset structure and a feature config.
Experiment scaffolding. Creating a new experiment with MLflow is routine work that takes 20–30 minutes by hand (configs, logging, saving artifacts). The agent does it from a template:
The project already has experiments/baseline/ with MLflow tracking.
Create experiments/attention_pooling/ with the same structure:
- train.py with logging of hyperparameters, per-epoch metrics, and final val_f1
- config.yaml with model parameters
- README.md describing the hypothesis
Hypothesis: replacing mean pooling with attention pooling in the classifier will improve F1
despite a +15% increase in parameters.Important: you write the hypothesis and the interpretation of results. The agent creates the infrastructure to test it.
Jupyter vs. marimo: What Works with the Agent
The previous article on Data, SQL and Analytics described two paths for Jupyter: the built-in NotebookEdit (without kernel access) and Jupyter MCP Server (with a full write → run → read-output cycle). In an ML context, one more recommendation is added.
marimo is a reactive replacement for Jupyter that stores notebooks as plain .py files. Claude Code works with it like any Python file: reads, edits, and runs it via bash. No JSON, no hidden cell states — and no special MCP server for the basic cycle. The marimo run notebook.py command runs a reproducible notebook without the "dirt" of cell execution order.
# Installation
pip install marimo
# Running an existing notebook in Claude Code
claude
# then in the dialogue:
# > run marimo run experiments/eda.py and show the outputFor heavy EDA where you need to see charts in real time — Jupyter MCP Server is still the better choice. For ML scripts where reproducibility matters — marimo or plain .py files.
flowchart TD
A[Data / raw] --> B[Feature preparation<br/>agent writes script]
B --> C[data/processed/]
C --> D[Experiment scaffolding<br/>agent creates structure]
D --> E[train.py + config.yaml<br/>+ MLflow/W&B logging]
E --> F[Launch training<br/>you / CI]
F --> G[Artifacts and metrics<br/>MLflow / W&B]
G --> H{Results analysis<br/>agent or you?}
H -->|Structured analysis| I[claude -p on mlruns/]
H -->|Interpretation| J[You: hypothesis selection]
J --> D
E --> K[Reproducibility<br/>agent checks seeds,<br/>configs, versions]
style H fill:#f5a623,color:#000
style J fill:#d0021b,color:#fffExperiment Tracking: Setup via the Agent
Setting up MLflow or Weights & Biases in a new project involves many small steps that are easy to delegate:
The project uses PyTorch and MLflow for tracking.
Currently train.py has no logging at all.
1. Add mlflow.start_run() with an automatic run name from the config
2. Log all fields from config.yaml as params
3. Log train_loss and val_loss every epoch
4. At the end — log_metric("best_val_f1", ...) and mlflow.pytorch.log_model()
5. Add mlflow>=2.10 to requirements.txt
Config structure example: @configs/baseline.yaml
train.py example: @experiments/baseline/train.pyThe agent will write the boilerplate, preserve the style of existing code, and won't forget mlflow.end_run() inside a finally block — something humans often overlook. Afterward — review the code yourself and run it.
W&B is similar, just with different syntax. If the project already has one integrated script, give it to the agent as a template and ask it to port the tracking to a new experiment.
Pipelines: Debugging and Extension
ML pipelines on Airflow, Prefect, or Metaflow tend to fail at the most unexpected moments. Claude Code handles two scenarios well.
Debugging a failed pipeline. Give the agent the log and the code:
Here is the log from a failed DAG in Airflow:
<paste traceback>
DAG file: @dags/training_pipeline.py
Task file: @src/pipeline/train_task.py
Find the root cause, propose a fix, and add error handling for this case.The agent sees the context of both files, reads the traceback, and often pinpoints the root cause immediately — a wrong artifact path, a type mismatch, a missing dependency between tasks.
Adding a new step. If the pipeline needs to be extended — give the agent the existing DAG as a template:
Look at the structure of @dags/training_pipeline.py.
Add a new step evaluate_on_holdout after the train_model task:
- loads the best checkpoint from MLflow by run_id
- runs inference on @data/holdout.parquet
- saves a report to reports/holdout_metrics.json
- marks the task as failed if f1 < 0.75
Preserve the style: same retry patterns, same Airflow variables, same logging format.The @ symbol for loading files into context is the same mechanism described in CLAUDE.md and the Memory System. Here it works as "pass a style reference."
Reproducibility: the Agent as a Hygiene Checker
ML experiment reproducibility depends on many small details that are easy to miss: random seeds, library versions, fixed data ordering. You can ask Claude Code to walk through the code and find issues:
Check experiments/attention_pooling/train.py for reproducibility:
- Are all random seeds fixed? (torch, numpy, random, cuda)
- Are dependency versions pinned in requirements.txt?
- Is the data path not hardcoded in the code?
- Are results saved with a run_id rather than overwriting previous ones?
Apply the necessary fixes.A typical result: the agent adds torch.backends.cudnn.deterministic = True, replaces a hardcoded path with Path(config.data_dir), and fills in exact versions in requirements.txt.
Configs instead of command-line arguments. If the project passes parameters through dozens of argparse arguments — that's a signal to switch to YAML configs (Hydra, OmegaConf). The agent can perform this refactoring:
Port @experiments/baseline/train.py from argparse to Hydra.
Configs should live in configs/, with the base config at configs/default.yaml.
Group configs: configs/model/, configs/data/.
Preserve all existing parameters.After refactoring, each experiment run has a fixed config file — reproducing a run with the same parameters becomes trivial.
Headless Mode for Running Experiment Sweeps
When you need to run a series of configs — for example, a grid search over learning rate and batch size — you can use Claude Code's headless mode together with a shell script. More on headless mode in Headless Mode and CLI Scripting; here is just the ML pattern:
# Run results analysis after a series of experiments
claude -p "
Look at mlruns/ and find all runs tagged with experiment=attention_pooling.
Compare val_f1 by learning_rate and batch_size.
Output a table: lr | batch | best_val_f1 | best_epoch
Highlight the config with the best val_f1.
" --output-format json > results/sweep_analysis.jsonThe agent reads MLflow artifacts directly from the filesystem and returns a structured result. No need to open the UI or write a pandas script for analysis.
Where Human Oversight Is Required
ML is a domain where misinterpretation is costly. Here is where the agent requires explicit review:
Metric selection. Claude Code will generate code for any metric you ask for — but choosing the right one (precision vs. recall with imbalanced classes, NDCG vs. MAP for ranking) is up to you. The agent does not know what matters most for your product.
Interpreting results. "val_loss stopped decreasing at epoch 8" — is that early stopping or overfitting? The agent sees the numbers but does not understand the context: how much data there is, how complex the task is, whether that val_f1 is acceptable for deployment.
Data leakage. The agent writes splits and transformations faithfully within the scope of what you described — but it does not always detect feature leakage from the future. Subtle leakage (e.g., target encoding applied before the split) requires human review.
Production data. The rule from the previous article applies doubly here: an agent with access to S3 or a remote database can modify data if asked carelessly. A separate read-only IAM user for the agent is not paranoia — it is standard practice.
See also
- Data, SQL and Analytics — connecting databases via MCP, Jupyter MCP Server, and EDA patterns that underpin this section
- Subagents and Context Isolation — offloading parallel experiment runs to isolated subagents
- Headless Mode and CLI Scripting —
claude -pfor automating results analysis and CI integration - CLAUDE.md and the Memory System — locking in experiment naming conventions, project structure, and config style
- Typical Workflows: Explore, Plan, Implement — the explore → plan → code pattern, which maps naturally onto the ML cycle
- Managing the Context Window — large training logs and MLflow artifacts fill the context quickly; strategies for managing it
- Practice: GitHub, Databases, and Web APIs via MCP — ready-made MCP configurations for data access in ML pipelines
Источники
- Building Data Pipelines with Claude Code: Engineering Reliable, Reproducible LLM Systems
- Using Claude Code with marimo
- Can Claude Code Analyze Jupyter Notebooks for Data Science? What It Actually Does – Kanaries
- Getting Started with Claude Code for Data Scientists – Dataquest
- ML Experiment Tracker: Claude Code Skill
- Best practices for Claude Code