Apache Airflow — open source GitHub project

Apache Airflow is a platform for authoring, scheduling, and observing data-processing tasks.

What it is

Apache Airflow is a task orchestration platform. It is useful for chains of work: load data, transform it, validate it, export it, notify a team, and repeat on a schedule.

The project started at Airbnb and later became an Apache project. Its idea is to describe dependencies and schedules as Python code rather than scattered cron commands.

What is inside

The repository contains the scheduler, web UI, executors, operators, sensors, integration providers, database migrations, tests, and documentation.

The central concept is a DAG: a directed graph of tasks and dependencies. Airflow is especially useful when observability matters.

How it is used

Data teams use Airflow for ETL/ELT, reports, model training, file loading, warehouse synchronization, and technical checks.

For very simple jobs, Airflow can be heavy. It needs metadata storage, a scheduler, executors, and discipline in DAG code.

Strengths and limits

The strength is a mature dependency model and a broad integration ecosystem.

The limit is operational complexity: it must be upgraded, monitored, secured, and kept maintainable.

Airflow is especially useful when a team thinks in data lifecycles rather than isolated scripts: who owns a task, when it runs, which inputs it uses, how to retry failure, and where execution history is visible.

A mature installation needs more than DAGs. Naming rules, concurrency limits, secrets, alerts, provider management, and retry policy matter. Without them, the platform quickly becomes another source of late-night failures.

The repository is also a good example of an open platform with a large provider ecosystem. That ecosystem is powerful, but it means upgrades and compatibility checks become a regular part of operating Airflow.

This makes Airflow a platform decision, not just a Python library choice. The surrounding operating model matters as much as the DAG code.

Example

A minimal Airflow DAG

This shows the key idea: tasks are Python functions and dependencies are expressed in code.

Language: Python

from airflow.decorators import dag, task
from pendulum import datetime

@dag(start_date=datetime(2026, 1, 1), schedule="@daily", catchup=False)
def daily_report():
    @task
    def extract():
        return {"rows": 120}

    @task
    def load(data):
        print(f"loaded {data['rows']} rows")

    load(extract())

daily_report()