Maintainable by one person, reproducible, and with a defensible answer to “why not X?” at every layer.

One picture

flowchart TD
    TLC["TLC parquet (monthly)"]
    GCS["GCS data lake"]
    EXT["BigQuery external table"]
    OPT["BigQuery optimized table
(PARTITION + CLUSTER)"] MODELS["dbt models
dim_* / fact_trips / mart_*"] BI["Looker Studio"] TLC -->|Airflow| GCS GCS --> EXT EXT -->|CTAS| OPT OPT -->|dbt| MODELS MODELS --> BI

Four layers, each doing exactly one thing:

Layer Tool Job
IaC Terraform Make GCS / BigQuery / VM / SA / firewall reproducible
Orchestration Airflow (single VM) Monthly backfill, retries, cleanup
Warehouse BigQuery Turn parquet into a queryable, partitioned, clustered fact
Modeling dbt core Translate raw facts into business-readable mart tables

Why not the “more modern” options

The most common question is not “why these tools” but “why not Composer / Dataflow / Kafka”. An honest answer:

  • Composer: managed Airflow starts at roughly 10× the cost of one e2-standard-2 — not worth it for a solo project.
  • Dataflow / Pub/Sub: the upstream TLC data is monthly parquet files. “Stream” is a fake requirement here.
  • Snowflake / Redshift: BigQuery’s on-demand pricing and autoscaling are friendlier for low-frequency workloads.

The decision rule is one sentence: one person should be able to rebuild the whole thing within 30 minutes after something breaks.

Repo tour

What the rest of this series drills into

The next 16 posts go layer by layer: Terraform safety rails → DAG design → BigQuery partition/cluster → dbt modeling → testing → war stories → business insight. Each one focuses on a single concern and skips the architecture overview.