There’s a category of table that’s awkward to manage:

  • It’s real business data, not configuration
  • But it changes slowly — once a year, sometimes not at all
  • A dedicated ETL is overkill; hand-maintenance is forget-prone

NYC TLC’s taxi_zone_lookup.csv (265 rows mapping LocationID to borough/zone) is the canonical example. This project manages it as a dbt seed, and it’s been smoother than expected.

Where it lives

1
2
3
4
nyc_taxi_pipeline/dbt/
├── dbt_project.yml
└── seeds/
└── taxi_zone_lookup.csv

A single dbt seed uploads it to BigQuery:

1
2
dbt seed
# creates nyc_taxi_data.taxi_zone_lookup

Downstream models reference it just like any other model:

1
2
3
-- dim_zones.sql
LEFT JOIN {{ ref('taxi_zone_lookup') }} l
ON z.location_id = l.LocationID

Why this beats the alternatives

Approach Problem
Hand-create the table in BigQuery No one knows where it came from or what was edited
Put CSV in GCS + external table Extra layer, case-sensitivity surprises in CSV
Hardcode it in Python Need a release for every change
dbt seed Version-controlled, diff-able, one command to sync

The biggest win is the file is in Git. The hidden hazard with semi-static data isn’t that it changes — it’s that changes go unnoticed. Git is the cheapest audit log.

Good fit

  • ~1 to ~10,000 rows
  • Sub-monthly change rate
  • Source is hand-maintained or periodically downloaded
  • Examples: country/region mappings, FX tables, product taxonomy, tax rates

Bad fit

  • More than tens of thousands of rows (slow seed upload, Git diffs become meaningless)
  • High-frequency changes (write a real DAG)
  • Contains PII or credentials (don’t put it in Git)

Two habits worth pairing with seeds

Habit 1: declare column types in dbt_project.yml

CSV is untyped. Auto-detect happily turns LocationID into INT64 and silently strips leading zeros from codes:

1
2
3
4
5
6
7
8
seeds:
nyc_taxi_pipeline:
taxi_zone_lookup:
+column_types:
LocationID: int64
Borough: string
Zone: string
service_zone: string

Habit 2: test the seed in schema.yml

It’s the reference table — every downstream join depends on a unique key:

1
2
3
4
5
seeds:
- name: taxi_zone_lookup
columns:
- name: LocationID
tests: [unique, not_null]

A real reverse-validation story

A previous project hand-created this kind of table in BigQuery. An analyst once UPDATE‘d a few rows during testing and forgot to revert. Three weeks later someone noticed a borough name was wrong on a dashboard — and nobody could say what the “correct” version had been.

After moving to dbt seed, git log seeds/taxi_zone_lookup.csv is the version history and dbt seed --full-refresh is the rollback.

Files: nyc_taxi_pipeline/dbt/seeds/taxi_zone_lookup.csv, nyc_taxi_pipeline/dbt/models/dim_zones.sql