Architecture & Ops

Medallion lakehouse pipeline from Bronze ingestion through Gold marts, with layered validation and atomic publishing.

Bronze immutable source Silver contracts + quarantine Gold BI marts Validation 3-layer gates

High-Level Data Flow

Bronze

Partitioned Parquet + _MANIFEST.json lineage

Silver

dbt-duckdb Base Silver + Polars Enriched

Gold

BigQuery marts + partition idempotency

Mermaid diagram
flowchart TD subgraph Bronze["Bronze Layer (GCS)"] B1[Partitioned Parquet] B2[_MANIFEST.json] end subgraph Validation["Three-Layer Validation"] V1[Bronze Quality] V2[Silver Quality] V3[Enriched Quality] end subgraph Silver["Silver Layer (Local/GCS)"] S1[Base Silver dbt-duckdb] S2[Dimension Snapshots customers, products] S3[Enriched Silver Polars transforms] end subgraph Gold["Gold Layer (BigQuery)"] G1[Fact Tables] G2[Dimension Tables] end Bronze -->|Validate| V1 V1 -->|Pass| S1 S1 -->|Generate| S2 S1 -->|Validate| V2 V2 -->|Pass| S3 S3 -->|Validate| V3 V3 -->|Pass| Gold S1 -.->|Quarantine| Q[Quarantine Tables]

Detailed Pipeline Flow

Mermaid diagram
flowchart LR subgraph Input["Bronze Input"] B1[Raw Parquet partitioned by ingest_dt] B2[Lineage Metadata _MANIFEST.json] end subgraph Dims["Dimension Snapshots"] D1{Dims Fresh?} D2[Snapshot Customers] D3[Snapshot Products] D4[_latest.json pointer] end subgraph BaseSilver["Base Silver dbt-duckdb"] BS1[Type Casting & Cleaning] BS2[Deduplication] BS3[PK/FK Validation] BS4[Quarantine Rejects] end subgraph EnrichedSilver["Enriched Silver Polars"] ES1[Cart Attribution] ES2[Customer LTV] ES3[Product Performance] ES4[Daily Metrics] ES5[10 transforms total] end subgraph GCS["GCS Publish"] GCS1[Staging Prefix _staging/run_id/] GCS2[Atomic Publish] GCS3[Manifest + Pointer] end subgraph BQ["BigQuery Load"] BQ1[Parquet Import WRITE_TRUNCATE] BQ2[Partition-level Idempotency] end subgraph GoldMarts["Gold Marts dbt-bigquery"] GM1[Fact: Orders] GM2[Fact: Customers] GM3[8 fact tables] end Input --> Dims D1 -->|Stale| D2 & D3 D2 & D3 --> D4 D1 -->|Fresh| BaseSilver D4 --> BaseSilver BaseSilver --> ES1 & ES2 & ES3 & ES4 & ES5 ES5 --> GCS1 GCS1 --> GCS2 GCS2 --> GCS3 GCS3 --> BQ1 BQ1 --> BQ2 BQ2 --> GM1 & GM2 & GM3 BS4 -.->|Rejected Rows| QT[Quarantine Tables]

Three-Layer Validation Framework

Bronze Validation

Manifest exists, row counts, schema conformance, coverage

Silver Validation

PK/FK integrity, quarantine rate, row loss thresholds

Enriched Validation

Business rules, min rows, null analysis

Airflow DAG Architecture

Dim Refresh Pipeline

Freshness gates, parallel snapshots, lightweight validation.

Silver to Gold Pipeline

Full Bronze → Gold with validation stages and optional BigQuery load.

Airflow DAG orchestration overview
Core pipeline DAG with Bronze → Silver → Gold dependencies.
Airflow task graph with quality gates
Quality gates embedded in task graph for early failure detection.

Partition-level recovery

Backfill single partitions without replaying entire tables.

Data quality enforcement

Fail fast on contract violations before downstream publish.

Layered orchestration

Independent Bronze, Base Silver, Enriched Silver, Gold schedules.

Spec-Driven Orchestration

Specs define tables, partitions, PK/FK rules, and transform dependencies for dynamic DAG generation.

base_silver:
  tables:
    - name: orders
      partition_key: ingestion_dt
      primary_key: [order_id]
      foreign_keys:
        - column: customer_id
          references: customers.customer_id

Observability & Monitoring

Audit Trail

Per-table JSON audits for SLA dashboards and alerts.

Validation Reports

Bronze, Silver, Enriched, and Dims snapshot reports.

Structured Logging

Local logs + Airflow task logs with enriched context.

Performance Characteristics

Base Silver

<2GB memory, under 2 minutes for 8 tables.

Enriched Silver

<6GB memory, under 5 minutes for 10 transforms.

Gold Marts

Warehouse execution under 3 minutes for 8 facts.