The repo is named CabStream-ETL — and yet there’s no Kafka, no Pub/Sub, no Dataflow, not even managed Airflow (Composer). Everyone who sees the repo asks why.
Here’s the honest selection process.
Start with the upstream shape
NYC TLC publishes yellow taxi data as:
- Monthly parquet files
- Released about 1 ~ 2 months after the month ends
- Stable URL pattern (
yellow_tripdata_2023-01.parquet) - 50 ~ 200 MB per file
So the upstream itself is discrete, low-frequency, batch.
“Streaming” requires the upstream to actually produce a stream of events. If the upstream publishes one file a month, sticking a Kafka in front of it is just inventing extra work.
Killing the streaming options one by one
Kafka / Pub/Sub
Fits: high-frequency event streams (clickstream, IoT, order events).
This project:
- Upstream isn’t events — it’s parquet files. To use Kafka I’d have to write a “split files into events” producer. Pointless.
- BigQuery already supports streaming inserts. There is no latency requirement to optimize for.
Verdict: not introducing.
Dataflow
Fits: very large ETL (hundreds of TB), complex windowed aggregation, true stream processing.
This project:
- ~3 M rows/month. BigQuery CTAS finishes in seconds.
- No windowed aggregation (aggregation lives in dbt).
- Dataflow’s baseline worker count and cost don’t make sense at this size.
Verdict: not introducing.
Composer (managed Airflow)
Fits: multi-person teams, strict SLAs, no appetite for running the scheduler.
This project:
- One person.
- No external SLA (a one-day dashboard delay is invisible).
- Composer baseline ~$300/month. Self-hosted e2-standard-2 is ~$30/month.
Verdict: not introducing.
So why is the repo still called “CabStream”
Honest answer: I did start out wanting to make it streaming. The plan was Pub/Sub for synthetic events, Dataflow for windowed aggregation, BigQuery streaming insert.
Halfway into week one it became obvious that:
- 70% of my time would be spent manufacturing fake events rather than analyzing real business
- Every business query I cared about was offline analysis (by day, by zone) — second-level latency was irrelevant
- For the same (in fact lower) cost, the batch design let me build two more dashboards
After cutting the streaming layer, the overall complexity halved. I kept the name as a reminder — “looks cool” and “is actually needed” are different things.
A “do you really need streaming?” checklist
| Signal | Streaming? |
|---|---|
| Upstream is event-sourced (clicks, sensors, orders) | ✓ |
| Business needs second-to-minute latency | ✓ |
| Windowed / sessionized aggregation required | ✓ |
| Missing an event is unrecoverable (online features, risk control) | ✓ |
| Upstream is scheduled files | ✗ |
| Use case is BI dashboards | ✗ |
| “The boss says we should do streaming” | Not a signal |
You need four checks before considering streaming. Zero checks and you’re trading engineering complexity for a buzzword on a slide.
So where is this project actually “modern”
- All infra in IaC (Terraform)
- Version-controlled SQL modeling (dbt)
- CI running tests + validate + parse
- Single SA, least privilege, IP-restricted firewall
Those are the parts of “modern data stack” that survive over time — and they matter much more than streaming-vs-batch.
One-liner
Selection should run from “shape of the problem” → “tool”, not “tool I want to use” → “problem”. Not using Kafka isn’t embarrassing. Forcing it in is.