dbt’s four built-in generic tests — unique, not_null, accepted_values, relationships — look unremarkable. Each one maps to a specific real-world incident. This project’s schema.yml is short, but every entry has earned its place.
1. unique — catches “duplicate primary key”
1 | - name: fact_trips |
trip_id is a SHA256 over business fields:
1 | TO_HEX(SHA256(CONCAT( |
How duplicates appear: upstream accidentally posts two copies of the same month (human error). After incremental merge, two identical rows hash to the same trip_id → unique fails loudly.
Cost without this test: mart COUNT(*) doubles for that day; an analyst spends half a day explaining a “demand spike” that doesn’t exist.
2. not_null — catches “upstream silently changed”
1 | - name: yellow_tripdata |
Critically important on external sources. NYC TLC occasionally publishes a “revised” parquet where a column that was effectively non-null becomes nullable. They consider it backward-compatible — it’s a disaster for downstream joins.
Real catch: PULocationID came back NULL for some trips after TLC started including “street hail without GPS upload” records. Without the test, those rows produce NULL after joining dim_zones, then collapse into a phantom group in mart_location_trips.
3. accepted_values — catches “business semantics drift”
Not in schema.yml yet, but strongly recommended:
1 | - name: payment_type |
NYC TLC has expanded payment_type twice historically (added “voided trip”, “unknown”). New values don’t break any SQL — they just leak a “NULL/Other” bucket into every dashboard sliced by payment type. Analysts find it eventually, but only after the report is wrong.
With accepted_values, a new value → CI fails → you notice → you decide what to do with it. Active instead of reactive.
4. relationships — catches “missing dimension row”
1 | - name: fact_trips |
Every pickup_location_id in the fact should resolve in dim_zones. If taxi_zone_lookup.csv falls behind (TLC adds a new zone), the fact has orphan IDs.
Because mart_location_trips uses LEFT JOIN, orphans show up as borough = NULL — no error, but the dashboard sprouts a mysterious “unknown borough” group.
relationships flags this right after dbt run, instead of two days later when an analyst spots it on a chart.
A priority order for adoption
If you can’t add all of these at once, do them in this order:
unique + not_nullon every primary key — catches duplicationnot_nullon critical source fields — catches upstream driftrelationshipsbetween fact and dim — catches dim gapsaccepted_valueson low-cardinality columns — catches semantic drift
Four boxes ticked removes ~80% of the “weird data” tickets. The remaining 20% need custom generic tests — separate post.
One anti-pattern
Don’t put not_null on mart tables. Marts are aggregates; nulls are legal there (a zone with zero trips today may not appear at all, or may appear with AVG = NULL). not_null belongs at the source of facts, not the aggregate.