Tutorials almost always set schedule_interval to @daily or a cron expression. This project’s DAG looks like this instead:
1 | with DAG( |
“Manual trigger + params” looks counter-cultural. In backfill-heavy workloads it’s the most pleasant pattern I’ve found.
What running it looks like
UI: click “Trigger DAG with config”, then:
1 | {"months": ["2023-05", "2023-06"]} |
Or CLI:
1 | airflow dags trigger nyc_taxi_pipeline \ |
The task reads it like this:
1 | def download_taxi_data(**context) -> str: |
Why not @monthly + catchup=True
The mainstream pattern is to schedule it monthly and let catchup backfill. That has hidden costs:
- Backfills explode: scheduler may launch 6 concurrent runs, then BigQuery rate-limits you
- “Just January and May, skip the rest” is awkward —
catchuponly knows ranges execution_datesemantics are heavy; even after Airflow 2.x renamed itlogical_date, it confuses newcomers
Make “which months” an explicit parameter and all three problems vanish.
When this pattern fits
- Upstream is discrete files (monthly, versioned), not an event stream
- Tasks are idempotent, so re-runs are safe (this project uses
CREATE OR REPLACE) - The team is small and no external system depends on the DAG running “on time”
Flip side: if the business needs “today’s report by 9 AM”, go back to cron + SLA.
Two supporting pieces
Piece 1: a working default
1 | DEFAULT_MONTHS = ["2023-01", "2023-02", "2023-03"] |
Triggering without params still gives a smoke-test run. Hugely friendly for new maintainers.
Piece 2: validate the params
1 | def _year_month(month: str) -> tuple[str, str]: |
The Airflow UI form is untyped. The first time a user types "2023-1", fail fast with a clear message.
One-liner
When your DAG’s shape is “one run = process one discrete batch of inputs”,
schedule_interval=None+paramsis simpler than cron +catchup, and it unifies backfill and routine runs through the same surface.