SQL Stories · Ecom Data Lake Extension

Purpose: Hydrate the SQL Stories ecosystem with production-style data lake drops. The ecom-datalake-exten repository wraps the ecom_sales_data_generator engine with a CLI that converts synthetic CSV output into partitioned Parquet, enriches files with lineage metadata, and publishes them to cloud storage.

Python Click CLI Parquet Google Cloud Storage Pytest CI/CD

How It Extends SQL Stories

Lake hydration layer: ships generator output into Hive-partitioned Parquet with ingest-date folders.
Lineage baked in: every file carries event_id, batch_id, ingestion_ts, and source_file metadata.
Shipping automation: CLI commands stage manifests, _SUCCESS markers, and checksum audits for downstream pipelines.
Scenario ready: quick-run YAML configs produce historical backlogs that power story briefs, retention studies, and returns analysis.

This repo turns the synthetic generator into a lightweight data platform. Pair it with the Skills Builder notebooks or Portfolio Demo dashboards to show how ingestion, analysis, and storytelling connect.

Core CLI Workflow

1. Generate CSV Artifacts

ecomlake run-generator \
  --config gen_config/ecom_sales_gen_quick.yaml \
  --artifact-root artifacts \
  --messiness-level medium \
  --generator-src ../ecom_sales_data_generator/src

2. Export to Partitioned Parquet

ecomlake export-raw \
  --source artifacts/raw_run_20251019T173945Z \
  --target output/raw \
  --ingest-date 2024-02-15 \
  --target-size-mb 10

3. Publish to Your Lake

ecomlake upload-raw \
  --source output/raw \
  --bucket gs://your-project-raw \
  --prefix ecom/raw \
  --ingest-date 2024-02-15

Why It Matters for Stakeholders

Ops & Engineering: mirrors production ingestion patterns with dry-run support, credential management, and post-export hooks.
Analytics: guarantees consistent table layouts for dashboards, notebooks, and downstream SQL views.
Storycrafting: spin up multi-day contexts (promotions, return spikes, churn events) that feed SQL Stories briefs.

SQL Stories · Ecom Data Lake Extension

How It Extends SQL Stories

Core CLI Workflow

1. Generate CSV Artifacts

2. Export to Partitioned Parquet

3. Publish to Your Lake

Why It Matters for Stakeholders

Resources & Next Steps