A dbt project starts to suffer at scale: the models/ folder has 100+ .sql files, newcomers don’t know where to read, and old hands don’t know which downstream they’ll break.
Three-layer modeling isn’t a rule — it’s a tool for fighting that entropy. This project only has 6 models, but the shape is already useful.
What each layer does
flowchart LR
SRC["source: BigQuery raw"]
FACT["fact_trips
clean + derive + incremental"]
DAILY["mart_daily_trips
temporal aggregation"]
HOURLY["mart_hourly_trips
hour-of-day aggregation"]
LOC["mart_location_trips
geo aggregation"]
DDT["dim_datetime"]
DZ["dim_zones"]
SRC --> FACT
FACT --> DAILY
FACT --> HOURLY
FACT --> LOC
DZ --> LOC
SRC -.-> DDT
SRC -.-> DZ
| Layer | Who reads it | Change frequency | Key constraint |
|---|---|---|---|
| source/staging | Only downstream models | When upstream changes | 1:1 mapping, minimal logic |
| dim / fact | Marts + analysts | When business rules change | Unique keys, incremental strategy |
| mart | BI tools, APIs | When report needs change | Denormalized, wide, tuned to query pattern |
fact_trips: the single source of truth
1 | {{ config( |
Three decisions worth calling out:
trip_idfrom a deterministic hash. Upstream has no natural primary key but does have “this combination of fields is unique”. SHA256 +unique_keymakes merge re-runs idempotent.- Derived columns materialized in the fact.
pickup_date,pickup_hour,is_weekend,trip_duration_minutesare used by every mart — compute them once instead of re-EXTRACT-ing in three places. - Data-quality filters live here, not in marts.
trip_distance > 0 AND total_amount > 0is the definition of “what counts as a valid trip”. Define it once.
mart tables: wide, clustered to the query, no joins
1 | -- mart_hourly_trips.sql |
One small thing: there is no JOIN dim_datetime in this mart. The fact already carries pickup_hour and is_weekend.
A mart should be the shape a BI tool can use with a single
SELECT *.
The only mart that does join a dim is mart_location_trips (to attach borough/zone labels). That kind of “dim provides labels” join is the one worth doing.
What this layering actually buys you
- Marts are leaves — adding or changing one does not propagate
- Changing the fact has a known blast radius — dbt graphs the affected marts
- Data bugs localize fast — mart wrong → check fact → check source. Three layers, three checkpoints.
- Materialization strategies can differ per layer — fact is incremental + merge, marts are full-rebuild tables (small data, simple semantics)
When it’s overkill
- Fewer than ~5 models total
- One analyst, no downstream consumers
dbt runfinishes in seconds anyway
Any of those, and three layers is over-engineering — flat is fine. This project sits right at the threshold: six models, but the clarity gain is already visible.
Files: nyc_taxi_pipeline/dbt/models/fact_trips.sql, nyc_taxi_pipeline/dbt/models/mart_hourly_trips.sql