What it is
Apache Airflow is a task orchestration platform. It is useful for chains of work: load data, transform it, validate it, export it, notify a team, and repeat on a schedule.
The project started at Airbnb and later became an Apache project. Its idea is to describe dependencies and schedules as Python code rather than scattered cron commands.
What is inside
The repository contains the scheduler, web UI, executors, operators, sensors, integration providers, database migrations, tests, and documentation.
The central concept is a DAG: a directed graph of tasks and dependencies. Airflow is especially useful when observability matters.
How it is used
Data teams use Airflow for ETL/ELT, reports, model training, file loading, warehouse synchronization, and technical checks.
For very simple jobs, Airflow can be heavy. It needs metadata storage, a scheduler, executors, and discipline in DAG code.
Strengths and limits
The strength is a mature dependency model and a broad integration ecosystem.
The limit is operational complexity: it must be upgraded, monitored, secured, and kept maintainable.
Airflow is especially useful when a team thinks in data lifecycles rather than isolated scripts: who owns a task, when it runs, which inputs it uses, how to retry failure, and where execution history is visible.
A mature installation needs more than DAGs. Naming rules, concurrency limits, secrets, alerts, provider management, and retry policy matter. Without them, the platform quickly becomes another source of late-night failures.
The repository is also a good example of an open platform with a large provider ecosystem. That ecosystem is powerful, but it means upgrades and compatibility checks become a regular part of operating Airflow.
This makes Airflow a platform decision, not just a Python library choice. The surrounding operating model matters as much as the DAG code.
Example
A minimal Airflow DAG
This shows the key idea: tasks are Python functions and dependencies are expressed in code.
from airflow.decorators import dag, task
from pendulum import datetime
@dag(start_date=datetime(2026, 1, 1), schedule="@daily", catchup=False)
def daily_report():
@task
def extract():
return {"rows": 120}
@task
def load(data):
print(f"loaded {data['rows']} rows")
load(extract())
daily_report()