BigQuery’s two big performance levers are partition and cluster. The docs are thorough. The question that doesn’t get answered cleanly in practice is:
“I have ten columns that might show up in
WHERE. Which do I pick?”
There’s no silver bullet, but there’s a routine. Here it is on this project’s real data.
Step 0: characterize the data
NYC yellow taxi data is ~3 M rows/month. Three columns dominate WHERE clauses:
tpep_pickup_datetime— time, filtered in nearly every report (by day or hour)PULocationID— pickup zone, 100×+ skew between hot zonespayment_type— low cardinality (5 values)
Step 1: pick the partition
Partitioning physically splits the table. Partition pruning is the cheapest filter BigQuery has.
Rules of thumb:
| Property | Good partition? |
|---|---|
| In nearly every query’s filter | ✓ |
| Moderate cardinality (< 4000)* | ✓ |
| Even distribution | ✓ |
| Value is known at write time | ✓ |
tpep_pickup_datetime hits all four → daily partition:
1 | PARTITION BY DATE(tpep_pickup_datetime) |
Note: the partition column must be truncated to a date (DATE(...)) — partitioning by raw TIMESTAMP would blow past the 4000-partition limit.
*BigQuery caps a single table at 4000 partitions: ~11 years at daily granularity, only ~5 months at hourly.
Step 2: pick the cluster
Clustering physically sorts within each partition. It doesn’t prune partitions, but it can skip large segments of an “I’m filtering on this column” scan.
Rules of thumb:
- Pick high-cardinality columns (clustering on low cardinality buys almost nothing)
- Pick columns that show up in
WHERE/JOINoften - Up to 4 columns, order-sensitive (only prefix matches are pruned)
This project clusters on PULocationID:
1 | CLUSTER BY PULocationID |
Reasoning:
- Cardinality ~265 — high enough
- Almost every location-based mart filters or groups on it
payment_typecardinality is 5 — clustering on it earns roughly nothing
Step 3: in mart layers, use a different cluster key
Note that fact_trips is CLUSTER BY pickup_location_id, but mart_hourly_trips flips to:
1 | {{ config( |
Why? The query patterns differ:
- fact_trips: detail queries — “what happened in zone X on day Y”
- mart_hourly_trips: BI reports — “what does the 24-hour curve look like”
The same underlying data can — and should — have different cluster keys at different layers. A side benefit of multi-layer dbt modeling.
Approximate savings
Take a typical query:
1 | SELECT SUM(total_amount) |
| Table design | Bytes scanned | Relative cost |
|---|---|---|
| Unpartitioned, unclustered | ~2 GB (full table) | 100% |
| Daily partition only | ~50 MB (7 days) | 2.5% |
| Partition + cluster on PULocationID | ~3 MB | 0.15% |
The cluster gain shows up most when the filter value is concentrated on a high-cardinality column.
Three counter-intuitive takeaways
- More partitions is not better. Hourly partitions are almost always a trap; daily is enough.
- More cluster columns is not better. Only the first one prunes aggressively; later ones decay fast.
- Same data, multiple cluster variants is fine. Maintain a fact-table cluster and a mart-table cluster independently.
Files: nyc_taxi_pipeline/airflow/nyc_taxi_pipeline.py, nyc_taxi_pipeline/dbt/models/fact_trips.sql