The most fun part of the whole project wasn’t getting the DAG green — it was the moment the data started telling stories. These five findings come from this project’s
mart_*tables; each one is a couple of Looker Studio charts away.
1. Friday’s evening rush is at 22:00, not 18:00
Intuition: evening peak should be 18:00 ~ 19:00 (commute home).
Data (mart_hourly_trips split by weekday):
- Mon ~ Thu: peak at 18:00 ~ 19:00 — clean commute curve
- Friday: a second peak appears at 21:00 ~ 23:00, ~15% higher than the 18:00 one
- Weekends: commute peak vanishes entirely; gentle climb from 10:00, max between 22:00 ~ 02:00
Plausible explanation: Manhattan nightlife shifts “taxi home” deep into the evening. “Taxi” is essentially two different businesses on weekdays vs. weekends.
Business implication: a dynamic-pricing model that uses only a “weekday vs. weekend” binary feature misses Friday — which behaves like neither.
2. The real culprit behind rising fares isn’t fuel
Average total_amount rose ~12% from 2022 to 2023. The intuitive cause: fuel costs.
Decompose mart_location_trips:
- Average
trip_distanceis essentially flat - Average
fare_amount(mileage-based base fare) up ~3% congestion_surchargerose from ~5% to ~9% of total- Average
tip_amountup ~18%
So: surcharge plus tipping behavior dwarfs the base fare effect. “Drivers raised prices” is the cover story; passengers tipping more is most of it.
Business implication: a piece headlined “Taxi fares up 12%, blamed on fuel” gets contradicted by the underlying data.
3. The top pickup zones are scary concentrated
mart_location_trips ORDER BY trip_count DESC:
- 265 zones citywide
- Top 10 zones account for ~38% of all trips
- Top 50 account for ~75%
- The long tail of 200 zones contributes less than 10% combined
The hot zones are nearly all in Manhattan; JFK and LGA airports are the two non-Manhattan exceptions.
Business implication: dispatch optimization that focuses on Top 50 covers the vast majority of business. “Cover the whole city” is an expense without proportional return.
4. Credit-card tip median is 3× cash
Group by payment_type (1 = credit card, 2 = cash):
- Credit card: median tip
$3.5 (18% of fare) - Cash: median tip = $0
Note: that’s exactly $0, not “near zero”. The reason is that NYC TLC data doesn’t record cash tips — the driver pockets it and the system never sees it.
Business implication:
- Any “tipping by payment method” analysis must drop cash rows, otherwise conclusions are severely biased low
- A textbook data trap: missing isn’t “no tip given”; it’s “the system doesn’t know”.
5. JFK round-trip duration has 3× the variance of typical trips
Add trip_duration_minutes to fact_trips, filter PULocationID = 132 (JFK):
- Typical city trips: mean ~14 min, std ~8 min
- JFK pickups: mean ~38 min, std ~25 min
Why so much variance?
- Destinations are spread out (Manhattan vs. Queens vs. Long Island differ by tens of km)
- BQE during rush hour is 30 min; the same route at 3 AM is 10 min — 3× swing
- Some trips use ride-share lanes; others go via tolled bridges
Business implication: ETA estimation for JFK pickups cannot use the citywide average. You need to bucket by time-of-day × destination borough. High-variance segments like this are where data science earns its keep.
Closing
After weeks of pipeline work, the most satisfying moment wasn’t the green DAG — it was watching the data tell its own story. Building the infra without ever doing this is like building a kitchen and never cooking.
Next post: how to turn these five findings into five Looker Studio tiles in 30 minutes.
Sources: nyc_taxi_pipeline/dbt/models/mart_hourly_trips.sql, nyc_taxi_pipeline/dbt/models/mart_location_trips.sql, nyc_taxi_pipeline/dbt/models/fact_trips.sql