SQL Stories · Ecom Data Lake Extension
Purpose: Hydrate the SQL Stories ecosystem with production-style data lake drops. The
ecom-datalake-exten repository wraps the ecom_sales_data_generator engine with a CLI that
converts synthetic CSV output into partitioned Parquet, enriches files with lineage metadata, and publishes them to cloud storage.
Python
Click CLI
Parquet
Google Cloud Storage
Pytest
CI/CD
How It Extends SQL Stories
- Lake hydration layer: ships generator output into Hive-partitioned Parquet with ingest-date folders.
- Lineage baked in: every file carries
event_id,batch_id,ingestion_ts, andsource_filemetadata. - Shipping automation: CLI commands stage manifests,
_SUCCESSmarkers, and checksum audits for downstream pipelines. - Scenario ready: quick-run YAML configs produce historical backlogs that power story briefs, retention studies, and returns analysis.
This repo turns the synthetic generator into a lightweight data platform. Pair it with the Skills Builder notebooks or Portfolio Demo dashboards to show how ingestion, analysis, and storytelling connect.
Core CLI Workflow
1. Generate CSV Artifacts
ecomlake run-generator \
--config gen_config/ecom_sales_gen_quick.yaml \
--artifact-root artifacts \
--messiness-level medium \
--generator-src ../ecom_sales_data_generator/src
2. Export to Partitioned Parquet
ecomlake export-raw \
--source artifacts/raw_run_20251019T173945Z \
--target output/raw \
--ingest-date 2024-02-15 \
--target-size-mb 10
3. Publish to Your Lake
ecomlake upload-raw \
--source output/raw \
--bucket gs://your-project-raw \
--prefix ecom/raw \
--ingest-date 2024-02-15
Why It Matters for Stakeholders
- Ops & Engineering: mirrors production ingestion patterns with dry-run support, credential management, and post-export hooks.
- Analytics: guarantees consistent table layouts for dashboards, notebooks, and downstream SQL views.
- Storycrafting: spin up multi-day contexts (promotions, return spikes, churn events) that feed SQL Stories briefs.