BigQuery offers two ways to turn parquet into a queryable table:
bq load/LoadJob— copy parquet into BigQuery’s managed storage in one shot- External table + CTAS — register a virtual table over GCS, then
CREATE TABLE AS SELECTinto a real one
This project takes path 2, via two DAG tasks:
1 | create_external_table_task = BigQueryCreateExternalTableOperator( |
Where CREATE_OPTIMIZED_SQL is:
1 | CREATE OR REPLACE TABLE `{PROJECT_ID}.{BIGQUERY_DATASET}.yellow_tripdata` |
Why bother, when LoadJob is one step?
Three reasons
1. Schema declared once, in code
1 | EXTERNAL_TABLE_SCHEMA = [ |
The external table schema is part of the DAG file, in Git. The day upstream silently adds a column or changes a type, external-table creation fails — far safer than LoadJob‘s schema auto-detection (which loves to flip column order or coerce empty columns to STRING).
2. Partition/cluster live in SQL, not in job config
1 | PARTITION BY DATE(tpep_pickup_datetime) |
With CTAS, partition expressions, cluster keys, and derived columns (DATE(...), LOWER(...)) can be computed inside SELECT. LoadJob‘s timePartitioning only accepts an existing column.
Real example: if the upstream timestamp arrives as a string, one PARSE_TIMESTAMP(...) in CTAS handles it. With LoadJob, you’d load first and UPDATE after.
3. “Re-run idempotently” is cheap
CREATE OR REPLACE TABLE ... AS SELECT ... is a single statement: all-or-nothing, no half-loaded state.
LoadJob with WRITE_TRUNCATE truncates first and then loads — fail mid-way on a big file and you’ve lost the previous good copy. With the two-step approach:
- External table creation fails → physical table untouched
- CTAS fails → physical table still holds the previous good version
When LoadJob is actually better
- Truly huge data (hundreds of TB single batch) — save the full-table SELECT scan
- No partition/cluster needed
- Schema is rock-stable
For this project’s monthly, GB-scale data, the clarity of the two-step approach beats the marginal slot savings.
A side benefit people forget
External tables are free to keep around (no storage charge — you pay only when queried). That means:
- During development, you can build just the external table and validate schema/data before writing the CTAS
- When something breaks, you can query the external table in isolation to localize the fault — is it bad source data, or bad CTAS logic?
That “layered debuggability” is not something LoadJob can give you.