πŸ§ͺ Analyst Toolkit Tutorial: Full Data PipelineΒΆ

This interactive notebook demonstrates the complete analyst pipeline using a synthetic Palmer Penguins dataset generated from the dirty birds data synthetic data generator repository.

Each step in the pipeline is modular, YAML-configurable, and produces exports, plots, and certification-ready reports.

This toolkit is packaged using TOML (pyproject.toml) and can be run via script or notebook.

🧰 Toolkit Architecture: 3-Way Modular Design¢

This pipeline is built around a flexible ETL framework with three usage modes:

  • πŸ““ Notebook Mode

    • Run individual modules or the full pipeline interactively
    • Supports HTML dashboards, widgets, and live previews
    • Ideal for iterative exploration, first-pass audits, and QA workflows
  • 🧡 CLI Mode

    • Execute the full pipeline using run_toolkit_pipeline.py
    • Controlled via a master YAML config
    • Exports all reports, checkpoints, and logs to disk
  • πŸ§ͺ Hybrid Mode

    • Develop in notebooks, deploy via scripts
    • Reuse the same configs across testing and production

The toolkit handles essential data cleaning and transformation tasks, enabling analysts to focus on:

  • Exploratory Data Analysis (EDA)
  • Investigating anomalies and data quality issues
  • Extracting actionable insights from certified data
InΒ [1]:
# πŸ“ Load Configuration and Set Execution Context

from analyst_toolkit.m00_utils.config_loader import load_config

# Path to master config (modify if needed)
config_path = "config/run_toolkit_config.yaml"

# Load full configuration dictionary
config = load_config(config_path)

# Extract run-level settings
run_id = config.get("run_id", "default_run")
notebook_mode = config.get("notebook", True)

print(f"πŸ”§ Config loaded | Run ID: {run_id} | Notebook Mode: {notebook_mode}")
πŸ”§ Config loaded | Run ID: CLI_2_QA | Notebook Mode: True
InΒ [2]:
# πŸ“₯ Load Raw Data from CSV

from analyst_toolkit.m00_utils.load_data import load_csv

# Load input path from the global config (or override manually)
input_path = config.get("pipeline_entry_path", "data/raw/synthetic_penguins_v3.5.csv")
print(f"πŸ“‚ Loading data from: {input_path}")

# Load into DataFrame
df_raw = load_csv(input_path)
πŸ“‚ Loading data from: data/raw/synthetic_penguins_v3.5.csv

πŸ§ͺ Step 1: Run Initial Diagnostics (M01)ΒΆ

This module generates a profile of the raw data: shape, types, nulls, skewness, and sample rows.

This module profiles the raw dataset for key structural and quality checks:

  • Memory, Shape, Dtypes
  • Missing Values & Skewness
  • Duplicate Detection
  • Sample Rows & Descriptive Stats

βœ… All results are rendered in a collapsible dashboard with exportable reports.
You can toggle inline previews and export settings via the YAML config (diag_config_template.yaml).

πŸ› οΈ To modify thresholds or toggle sections, edit the config under diagnostics.settings.

InΒ [3]:
# πŸ“Š M01: Data Diagnostics – Profile Structure & Shape

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m01_diagnostics.run_diag_pipeline import run_diag_pipeline

# --- Load module-specific config ---
diag_config_full = load_config("config/diag_config_template.yaml")

# --- Run Diagnostics Module ---
# We pass the df_raw loaded in the previous step.
# The global run_id and notebook_mode are used.
df_profiled = run_diag_pipeline(
    config=diag_config_full, # Pass the full config object
    df=df_raw,
    notebook=notebook_mode,
    run_id=run_id
)
Stage: M01 Data Diagnostics βœ… | Columns with Nulls: 15 | Duplicate Rows Found: 1.0 | Shape: 5541 Rows, 15 Columns
πŸ“ˆ Key Metrics

πŸ”· Shape

Rows Columns
5541 15

🧠 Memory Usage

Memory Usage
3.26 MB

♻️ Duplicate Summary

Duplicate Rows Duplicate %
1 0.02
πŸ“ Full Profile & Cardinality

πŸ”’ High Cardinality

Column Unique Values
tag_id 2678
capture_date 1917
date_egg 1656
colony_id 19
Audit Remarks Key:
  • βœ… OK: Passed all configured quality checks.
  • ⚠️ High Skew: Skewness exceeds the configured threshold.
  • ⚠️ Unexpected Type: Data type does not match the expected type.

πŸ“š Full Data Profile

Column Dtype Unique Values Audit Remarks Missing Count Missing %
tag_id object 2678 βœ… OK 2242 40.46
species object 5 βœ… OK 166 3.00
bill length (mm) float64 1984 βœ… OK 429 7.74
bill depth (mm) float64 862 βœ… OK 417 7.53
flipper_length_mm float64 1466 βœ… OK 451 8.14
body_mass_g float64 3328 βœ… OK 406 7.33
age_group object 7 βœ… OK 121 2.18
sex object 6 βœ… OK 2739 49.43
colony_id object 19 βœ… OK 405 7.31
island object 11 βœ… OK 584 10.54
capture_date object 1917 βœ… OK 534 9.64
health_status object 9 βœ… OK 554 10.00
study_name object 12 βœ… OK 563 10.16
clutch_completion object 2 βœ… OK 463 8.36
date_egg object 1656 βœ… OK 836 15.09
πŸ”¬ Quantitative Summary

πŸ”’ Descriptive Statistics

Metric count mean std min 25% 50% 75% max skew kurtosis
bill length (mm) 5112.0 45.166682 5.666410 30.63 40.51 45.950 49.360 62.64 -0.145952 -0.606829
bill depth (mm) 5124.0 17.305377 2.231495 12.37 15.49 17.485 19.030 23.01 -0.111456 -0.897492
flipper_length_mm 5090.0 202.237800 14.342621 162.79 191.10 199.315 214.100 252.40 0.329099 -0.616376
body_mass_g 5135.0 3853.645265 898.232986 2376.56 3219.50 3742.000 4376.515 7378.33 0.616778 0.086446
πŸ“„ Preview of Duplicated Rows
tag_id species bill length (mm) bill depth (mm) flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
πŸ” First Rows Preview

πŸ“‹ First 5 Rows (.head)

tag_id species bill length (mm) bill depth (mm) flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
ADE-0001 Adelie 39.55 19.92 186.2 2500.0 Chick Male Biscoe West Biscoe 2024-13-03 Underweight PAPRI2022 Yes 2022-07-20
NaN Gentoo 48.23 13.00 NaN 4536.0 Adult Female Biscoe West NaN 2024-04-14 Healthy NaN Yes 2024-04-12
GEN-0001 Gentoo 46.22 13.91 212.8 2500.0 Juvenile Female Dream South Dream NaN Underweight PAPRI2020 Yes 2020-04-14
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Profile</h3>"), HBox(childre…

πŸ›‘οΈ Step 2: Run Schema & Content Validation (M02)ΒΆ

This module audits the dataset against a defined schema to catch issues early and guide cleaning steps:

  • Expected Columns & Dtypes
  • Allowed Categorical Values
  • Numeric Range Checks
  • Null Allowance (optional)

βœ… All results are displayed in a styled validation dashboard with exportable reports.
You can define strict or flexible rules in the YAML config (validation_config_template.yaml).

πŸ› οΈ To adjust enforcement (e.g. halt-on-fail), set fail_on_error and update rules under validation.schema_validation.

InΒ [4]:
# πŸ›‘οΈ M02: Schema & Content Validation – First Audit Pass

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline

# --- Load module-specific config ---
val_config_full = load_config("config/validation_config_template.yaml")

# --- Run Validation Module ---
df_validated = run_validation_pipeline(
    config=val_config_full,
    df=df_profiled,
    notebook=notebook_mode,
    run_id=run_id
)
Stage: M02 Data Validation ⚠️ | Checks Passed: 1/4 | Row Coverage: 36.62%
πŸ”Ž Validation Rules Summary
Validation Rule Description Status
Schema Conformity Verify column names match the expected schema. ⚠️ Fail (2 issues)
Dtype Enforcement Verify column data types match expectations. ⚠️ Fail (1 issues)
Categorical Values Verify values in categorical columns are within an allowed set. ⚠️ Fail (7 issues)
Numeric Ranges Verify values in numeric columns are within a defined range. βœ… Pass
Status Key:
  • βœ… Pass: The data conforms to this rule.
  • ⚠️ Fail: One or more issues were found. See drill-down for details.

Failure Details

⚠️ Drill-Down: Schema Conformity(click to expand & scroll)
Issue Type Columns
Missing bill_length_mm, bill depth_mm
Unexpected bill depth (mm), bill length (mm)
⚠️ Drill-Down: Dtype Enforcement(click to expand & scroll)
Column Expected Type Actual Type
flipper_length_mm int64 float64
⚠️ Drill-Down: Categorical Values(click to expand & scroll)

Rule Violated:

Values for column species must be in the allowed set.

Allowed Values:

['Adelie', 'Chinstrap', 'Gentoo']

Invalid Values Found:

Invalid Value Count
adeleie 148
Gentto 145

Rule Violated:

Values for column island must be in the allowed set.

Allowed Values:

['Dream', 'Biscoe', 'Torgersen', 'Cormorant', 'Shortcut']

Invalid Values Found:

Invalid Value Count
short cut 70
torg 61
unknown 59
bisco 55
cormor 47
dreamland 46

Rule Violated:

Values for column sex must be in the allowed set.

Allowed Values:

['male', 'female', 'UNKNOWN']

Invalid Values Found:

Invalid Value Count
Male 1308
Female 1227
F 83
? 74
M 61
Unknown 49

Rule Violated:

Values for column colony_id must be in the allowed set.

Allowed Values:

['Biscoe West', 'Cormorant East', 'Dream South', 'Shortcut Point', 'Torgersen North']

Invalid Values Found:

Invalid Value Count
cormorant NW 45
invalid_colony 36
Torgersen 35
Cormorant 34
biscoe 2 34
torgersen SE 31
TORGERSEN 4 30
short point 28
/Shortcut 26
Biscoe 25
dream island 24
Unknown 24
Dream Island 22
dream 19

Rule Violated:

Values for column age_group must be in the allowed set.

Allowed Values:

['Juvenile', 'Adult', 'Chick', 'UNKNOWN']

Invalid Values Found:

Invalid Value Count
juvenille 58
unk 48
ADLT 47
chik 29

Rule Violated:

Values for column health_status must be in the allowed set.

Allowed Values:

['Healthy', 'Critically Ill', 'Underweight', 'Unwell', 'Overweight', 'Unknown']

Invalid Values Found:

Invalid Value Count
critcal ill 36
Overwight 34
under weight 33
ok 30

Rule Violated:

Values for column study_name must be in the allowed set.

Allowed Values:

['PAPRI2019', 'PAPRI2020', 'PAPRI2021', 'PAPRI2022', 'PAPRI2023', 'PAPRI2024']

Invalid Values Found:

Invalid Value Count
PAPR12021 60
papri2024 58
STUDY_2022 57
PP2020 48
PAPR2023 46
PAPRI20X9 37

🧹 Step 3: Normalize & Standardize Data (M03)¢

This module performs rule-based cleaning and normalization to prepare the dataset for certification:

  • Column Renaming & Type Coercion
  • Value Mapping & Text Cleaning
  • Fuzzy Matching & Datetime Parsing

βœ… Results are rendered in a structured dashboard with before/after comparisons and audit previews.
All rules and output paths are controlled via the YAML config (normalization_config_template.yaml).

πŸ› οΈ To adjust cleaning logic, modify the rules block (e.g. value_mappings, preview_columns, etc).

InΒ [5]:
# 🧹 M03: Data Normalization – Standardizing Key Fields

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m03_normalization.run_normalization_pipeline import run_normalization_pipeline

# --- Load Config ---
norm_config_full = load_config("config/normalization_config_template.yaml")

# --- Run Normalization Module ---
# Uses df_validated from the previous step and global run_id/notebook_mode.
df_normalized = run_normalization_pipeline(
    config=norm_config_full,
    df=df_validated,
    notebook=notebook_mode,
    run_id=run_id
)
Stage: M03 Data Normalization βœ… | Action Types: 5 | Total Transformations: 20
βš™οΈ Normalization Actions (Transform Log)

✏️ Columns Renamed (2)

Original Name New Name
bill length (mm) bill_length_mm
bill depth (mm) bill_depth_mm

🧹 Strings Cleaned (2)

Column Operation
clutch_completion standardize_text
sex standardize_text

πŸ“… Datetimes Parsed (2)

Column Target Type
capture_date datetime64[ns]
date_egg datetime64[ns]

🧩 Values Mapped (7)

Column Mappings Applied
sex 7
species 1
island 1
colony_id 14
age_group 4
health_status 7
study_name 6

πŸ€– Fuzzy Matches (7)

Column Original Corrected Score
species Gentto Gentoo 83
species adeleie Adelie 92
island bisco Biscoe 91
island short cut Shortcut 94
island dreamland Dream 90
island cormor Cormorant 90
island torg Torgersen 90
πŸ“Š Column Value Analysis: Before & After(click to scroll)
Column: sex
Normalized Values
Value Count
NaN 2739
MALE 1369
FEMALE 1310
UNKNOWN 123
Value Audit
Value Original Count Normalized Count
NaN 2739 2739
Male 1308 0
Female 1227 0
F 83 0
? 74 0
M 61 0
Unknown 49 0
MALE 0 1369
FEMALE 0 1310
UNKNOWN 0 123
Column: island
Normalized Values
Value Count
Torgersen 1405
Dream 1184
Biscoe 1084
Cormorant 715
NaN 584
Shortcut 510
UNKNOWN 59
Value Audit
Value Original Count Normalized Count
Torgersen 1344 1405
Dream 1138 1184
Biscoe 1029 1084
Cormorant 668 715
NaN 584 584
Shortcut 440 510
short cut 70 0
torg 61 0
unknown 59 0
bisco 55 0
cormor 47 0
dreamland 46 0
UNKNOWN 0 59
Column: species
Normalized Values
Value Count
Gentoo 1815
Adelie 1784
Chinstrap 1776
NaN 166
Value Audit
Value Original Count Normalized Count
Chinstrap 1776 1776
Gentoo 1670 1815
Adelie 1636 1784
NaN 166 166
adeleie 148 0
Gentto 145 0
Column: health_status
Normalized Values
Value Count
Healthy 2194
Underweight 1411
Overweight 733
NaN 554
Critical 323
Sick 296
UNKNOWN 30
Value Audit
Value Original Count Normalized Count
Healthy 2194 2194
Underweight 1378 1411
Overweight 699 733
NaN 554 554
Unwell 296 0
Critically Ill 287 0
critcal ill 36 0
Overwight 34 0
under weight 33 0
ok 30 0
Critical 0 323
Sick 0 296
UNKNOWN 0 30
Column: colony_id
Normalized Values
Value Count
Torgersen North 1490
Dream South 1216
Biscoe West 1092
Cormorant East 767
Shortcut Point 511
NaN 405
UNKNOWN 60
Value Audit
Value Original Count Normalized Count
Torgersen North 1394 1490
Dream South 1151 1216
Biscoe West 1033 1092
Cormorant East 688 767
Shortcut Point 457 511
NaN 405 405
cormorant NW 45 0
invalid_colony 36 0
Torgersen 35 0
Cormorant 34 0
biscoe 2 34 0
torgersen SE 31 0
TORGERSEN 4 30 0
short point 28 0
/Shortcut 26 0
Biscoe 25 0
Unknown 24 0
dream island 24 0
Dream Island 22 0
dream 19 0
Column: age_group
Normalized Values
Value Count
Adult 3822
Juvenile 1073
Chick 477
NaN 121
UNKNOWN 48
Value Audit
Value Original Count Normalized Count
Adult 3775 3822
Juvenile 1015 1073
Chick 448 477
NaN 121 121
juvenille 58 0
unk 48 0
ADLT 47 0
chik 29 0
UNKNOWN 0 48
Column: study_name
Normalized Values
Value Count
PAPRI2020 1122
PAPRI2021 1024
PAPRI2022 916
PAPRI2023 824
PAPRI2024 803
NaN 563
PAPRI2019 252
UNKNOWN 37
Value Audit
Value Original Count Normalized Count
PAPRI2020 1074 1122
PAPRI2021 964 1024
PAPRI2022 859 916
PAPRI2023 778 824
PAPRI2024 745 803
NaN 563 563
PAPRI2019 252 252
PAPR12021 60 0
papri2024 58 0
STUDY_2022 57 0
PP2020 48 0
PAPR2023 46 0
PAPRI20X9 37 0
UNKNOWN 0 37
Column: capture_date
Normalized Values
Value Count
NaT 915
2023-01-18 10
2024-05-09 10
2024-02-01 9
2023-06-12 8
2020-12-25 8
2022-11-15 8
2023-06-10 8
2023-03-22 8
2024-01-01 8
2022-08-04 8
2022-12-03 8
2024-06-19 8
2023-09-27 7
2022-09-28 7
2022-09-27 7
2023-10-22 7
2024-04-25 7
2023-07-25 7
2023-08-24 7
Value Audit
Value Original Count Normalized Count
NaN 534 915
9999-99-99 39 0
error 33 0
not-a-date 30 0
2023-01-18 10 10
2024-05-09 10 10
2024-02-01 9 9
2020-12-25 8 8
2022-08-04 8 8
2022-11-15 8 8
2022-12-03 8 8
2023-03-22 8 8
2023-06-10 8 8
2023-06-12 8 8
2024-01-01 8 8
2024-06-19 8 8
2020-07-02 7 7
2021-01-21 7 7
2022-01-09 7 7
2022-09-27 7 7
Column: date_egg
Normalized Values
Value Count
NaT 836
2019-12-11 13
2019-12-27 12
2020-10-11 11
2020-07-20 11
2019-12-17 11
2019-11-25 11
2020-06-25 11
2021-04-03 10
2021-04-16 10
2023-10-08 10
2021-07-05 9
2022-10-26 9
2021-01-06 9
2022-07-13 9
2022-02-07 9
2020-01-22 9
2021-08-30 9
2020-09-20 9
2020-01-17 9
Value Audit
Value Original Count Normalized Count
NaN 836 836
2019-12-11 13 13
2019-12-27 12 12
2019-11-25 11 11
2019-12-17 11 11
2020-06-25 11 11
2020-07-20 11 11
2020-10-11 11 11
2021-04-03 10 10
2021-04-16 10 10
2023-10-08 10 10
2020-01-17 9 9
2020-01-22 9 9
2020-02-26 9 9
2020-09-20 9 9
2021-01-06 9 9
2021-07-05 9 9
2021-08-30 9 9
2021-10-22 9 9
2022-02-07 9 9
Column: clutch_completion
Normalized Values
Value Count
yes 4314
no 764
NaN 463
Value Audit
Value Original Count Normalized Count
Yes 4314 0
No 764 0
NaN 463 463
yes 0 4314
no 0 764

πŸ›‘οΈ Step 4: Certification Gate (M02)ΒΆ

This step re-uses the Validation Module (M02), but with a stricter configuration to act as a quality gate. It is designed to halt the pipeline if violations are found:

  • βœ… All column names, data types, categorical values, and numeric ranges must pass
  • πŸ›‘ fail_on_error: true triggers a hard stop on validation failure

πŸ“¦ This step can be run at any point in the pipeline β€” not just the end.
Use it wherever you want to certify a dataset snapshot or block further execution unless data meets expectations.

βœ… Results are rendered inline with full export support.
All certification rules live in the YAML config (certification_config_template.yaml).

πŸ› οΈ Adjust gatekeeping behavior by modifying schema rules or toggling fail_on_error.

InΒ [6]:
# πŸ›‘οΈ M02: Certification (Strict Validation Gatekeeper)

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline

# --- Load Certification Config ---
cert_config_full = load_config("config/certification_config_template.yaml")

# --- Run Final Certification Pass ---
logging.info("πŸš€ Starting Certification Gate (re-using M02)")

df_certified = run_validation_pipeline(
    config=cert_config_full,
    df=df_normalized,
    notebook=notebook_mode,
    run_id=run_id
)
Stage: M02 Data Validation βœ… | Checks Passed: 4/4 | Row Coverage: 100.0%
πŸ”Ž Validation Rules Summary
Validation Rule Description Status
Schema Conformity Verify column names match the expected schema. βœ… Pass
Dtype Enforcement Verify column data types match expectations. βœ… Pass
Categorical Values Verify values in categorical columns are within an allowed set. βœ… Pass
Numeric Ranges Verify values in numeric columns are within a defined range. βœ… Pass
Status Key:
  • βœ… Pass: The data conforms to this rule.
  • ⚠️ Fail: One or more issues were found. See drill-down for details.

🧹 Step 5: Deduplication (M04)¢

This module identifies and handles duplicate rows in the dataset, using the logic from m04_duplicates.

You can choose to:

  • πŸ” Flag duplicates for review
  • βœ‚οΈ Remove duplicates directly (default: keep first occurrence)

βœ… Configurable logic lets you define:

  • Which columns to check for duplication (subset_columns)
  • Whether to flag or drop (mode: "flag" or "remove")
  • Columns to preview (hide IDs, timestamps, etc.)

πŸ“„ Results are displayed with an inline preview and summary plots.

πŸ› οΈ Adjust deduplication behavior in dups_config_template.yaml.

InΒ [7]:
# ♻️ M04: Deduplication and Duplicates Handling

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m04_duplicates.run_dupes_pipeline import run_duplicates_pipeline
import logging

# --- Load Config ---
dupes_config_full = load_config("config/dups_config_template.yaml")


# --- Run Duplicates Module ---
df_deduped = run_duplicates_pipeline(
    config=dupes_config_full,
    df=df_certified,
    notebook=notebook_mode,
    run_id=run_id
)
Stage: M04 Deduplication ⚠️ | Rows Flagged: 1219 | Criteria: Based on `tag_id`, `species`, `capture_date`
πŸ“ˆ Summary of Changes
Metric Value
Total Row Count 5541
Duplicate Rows Flagged 1219
πŸ” Duplicate Clusters Found (click to scroll)
tag_id species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
ADE-0001 Adelie 39.55 19.92 186.20 2500.00 Chick MALE Biscoe West Biscoe NaT Underweight PAPRI2022 yes 2022-07-20
ADE-0001 Adelie 42.60 21.37 184.50 2477.78 Juvenile MALE Biscoe West Biscoe NaT Healthy PAPRI2022 yes 2022-07-20
ADE-0001 Adelie 38.70 20.78 202.74 2650.73 Juvenile MALE Biscoe West Biscoe NaT Underweight PAPRI2022 yes 2022-07-20
ADE-0013 Adelie 40.28 18.10 188.60 3224.00 Juvenile NaN NaN Cormorant NaT NaN PAPRI2022 yes 2022-06-18
ADE-0013 Adelie 41.51 19.31 182.31 3322.26 Adult NaN NaN Cormorant NaT Overweight PAPRI2022 yes 2022-06-18
ADE-0049 Adelie NaN 18.46 185.40 3326.00 Adult FEMALE Shortcut Point Shortcut NaT Healthy PAPRI2024 yes 2024-08-29
ADE-0049 Adelie NaN 17.77 176.49 3175.64 Adult FEMALE Shortcut Point Shortcut NaT Overweight PAPRI2024 yes 2024-08-29
ADE-0054 Adelie 42.06 17.93 NaN 4125.00 Adult MALE Biscoe West Biscoe NaT Overweight PAPRI2022 NaN 2022-10-28
ADE-0054 Adelie 42.53 18.07 NaN 4342.78 Adult MALE Biscoe West Biscoe NaT Critical PAPRI2022 NaN 2022-10-28
ADE-0073 Adelie 41.64 17.10 192.80 2500.00 Chick FEMALE Torgersen North NaN NaT Overweight PAPRI2023 yes 2023-02-24
ADE-0073 Adelie 41.30 16.87 206.44 2567.20 Juvenile FEMALE Torgersen North NaN NaT Underweight PAPRI2023 yes 2023-02-24
ADE-0076 Adelie 39.82 18.13 184.90 3642.00 Adult NaN Cormorant East Cormorant NaT NaN PAPRI2021 yes 2021-06-18
ADE-0076 Adelie 42.27 18.40 183.97 3753.19 Adult NaN Cormorant East Cormorant NaT NaN PAPRI2021 yes 2021-06-18
ADE-0076 Adelie 42.67 19.15 178.12 3569.29 Adult NaN Cormorant East Cormorant NaT NaN PAPRI2021 yes 2021-06-18
ADE-0119 Adelie 35.76 17.78 179.84 2716.83 NaN NaN Dream South Dream 2022-07-08 Healthy PAPRI2021 yes 2021-07-05
ADE-0119 Adelie 40.13 18.80 201.99 2983.99 NaN NaN Dream South Dream 2022-07-08 Healthy PAPRI2021 yes 2021-07-05
ADE-0137 Adelie 43.03 17.11 193.40 NaN Adult UNKNOWN Torgersen North Torgersen NaT Healthy PAPRI2021 yes 2021-03-21
ADE-0137 Adelie 46.46 16.39 186.41 NaN Adult UNKNOWN Torgersen North Torgersen NaT Underweight PAPRI2021 yes 2021-03-21
ADE-0155 Adelie 38.31 17.86 197.70 3512.00 Adult NaN Torgersen North Torgersen NaT Healthy PAPRI2023 no NaT
ADE-0155 Adelie 39.45 19.48 197.89 3527.76 Adult NaN Torgersen North Torgersen NaT Overweight PAPRI2023 no NaT
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Summary</h3>"), HBox(childre…

πŸ“ Step 6: Detect Outliers (M05)ΒΆ

This module (m05_detect_outliers) scans numeric columns for outliers using configurable logic:

  • Z-Score or IQR methods (per column or global default)
  • Adds binary flags (e.g., *_outlier) to the dataset if append_flags: true
  • Skips non-numeric or excluded fields via exclude_columns

πŸ“Š Interactive PlotViewer
If enabled, the PlotViewer renders boxplots, histograms, and violin plots inline
β€” giving a fast visual summary of where anomalies occur.

πŸ“ What’s Exported:

  • βœ… df_outliers_flagged: DataFrame with new _outlier columns
  • βœ… detection_results: thresholds and summary tables
  • βœ… Plots: saved to exports/plots/outliers/{run_id}/
  • βœ… Report: XLSX or CSV, based on config

πŸ› οΈ Configure methods, thresholds, excluded columns, and plot types in outlier_config_template.yaml.

InΒ [8]:
# πŸ“ M05: Detect Outliers and Plot Visuals

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m05_detect_outliers.run_detection_pipeline import run_outlier_detection_pipeline
from IPython.display import display

# --- Load module-specific config ---
outlier_config_full = load_config("config/outlier_config_template.yaml")

# The 'df_deduped' variable should be the output from your M04 Duplicates module
if 'df_deduped' in locals():
    df_outliers_flagged, detection_results = run_outlier_detection_pipeline(
        config=outlier_config_full,
        df=df_deduped,
        notebook=notebook_mode,
        run_id=run_id
    )
Stage: M05 Outlier Detection ⚠️ | Total Outliers Found: 19 | Columns Affected: 2
πŸ“‹ Outlier Detection Log
column method outlier_count lower_bound upper_bound outlier_examples
bill_length_mm iqr 1 27.235000 62.635000 [62.64]
body_mass_g zscore 18 709.829815 6997.460715 [7000.0, 7000.0, 7000.0, 7000.0, 7000.0]
πŸ” Preview of Rows Containing Outliers
tag_id species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg is_duplicate
NaN Gentoo NaN 14.41 221.90 7000.00 Adult NaN Torgersen North Torgersen 2019-10-31 Healthy PAPRI2019 NaN NaT False
NaN NaN 47.68 17.62 NaN 7000.00 Adult NaN Torgersen North Torgersen 2021-08-17 Healthy PAPRI2021 NaN 2021-08-14 False
GEN-0041 Gentoo 45.63 14.13 213.20 7000.00 Juvenile FEMALE Dream South Dream 2021-12-02 Healthy PAPRI2021 NaN 2021-11-23 False
NaN Gentoo 46.39 13.84 206.30 7000.00 Adult NaN Cormorant East Cormorant 2022-10-26 Healthy PAPRI2022 NaN 2022-10-12 False
ADE-0182 Adelie 38.46 17.16 185.10 7000.00 Adult NaN Dream South Dream 2024-02-03 Overweight PAPRI2024 yes 2024-01-31 False
NaN Gentoo 49.36 13.00 224.10 7000.00 Adult NaN Torgersen North Torgersen NaT Healthy NaN no NaT True
NaN Gentoo 40.59 14.37 230.00 7000.00 Adult MALE NaN Biscoe NaT Healthy PAPRI2021 yes 2021-03-25 True
GEN-0301 Gentoo 44.56 16.48 212.70 7000.00 Adult MALE Biscoe West Biscoe 2022-12-12 Healthy PAPRI2022 no NaT False
NaN Gentoo 45.16 15.57 218.40 7000.00 Adult FEMALE NaN Cormorant 2021-07-30 Healthy PAPRI2021 yes 2021-07-17 False
GEN-0681 Gentoo 44.73 13.94 217.80 7000.00 Adult NaN Torgersen North Torgersen NaT Healthy PAPRI2022 yes 2022-11-07 True
GEN-0706 Gentoo 45.74 14.02 217.80 7000.00 Adult NaN Dream South Dream 2024-02-28 Healthy PAPRI2024 yes 2024-02-21 False
GEN-0743 Gentoo 49.05 14.49 213.20 7000.00 Adult FEMALE Dream South Dream NaT Healthy PAPRI2020 yes 2020-03-17 False
CHN-0860 Chinstrap 50.88 18.49 206.10 7000.00 Adult NaN Cormorant East Cormorant 2024-07-09 Overweight PAPRI2023 yes 2023-11-16 False
GEN-0974 Gentoo 50.57 15.89 220.00 7000.00 Adult NaN Torgersen North NaN 2021-01-05 NaN PAPRI2021 yes 2020-12-26 False
GEN-0681 Gentoo 47.77 13.84 222.73 7378.33 Adult NaN Torgersen North Torgersen NaT Overweight PAPRI2022 yes 2022-11-07 True
NaN Chinstrap 51.63 18.69 212.94 7128.38 Adult FEMALE Torgersen North Torgersen 2022-03-25 Overweight PAPRI2020 NaN 2020-03-12 False
NaN Gentoo 47.71 13.93 236.20 7085.98 Adult NaN Torgersen North Torgersen NaT Critical NaN no NaT True
CHN-0219 Chinstrap 62.64 18.00 204.26 2770.38 Juvenile NaN Torgersen North UNKNOWN 2020-10-22 Critical PAPRI2019 yes 2019-10-16 False
NaN Gentoo NaN 14.99 219.59 7128.48 Adult NaN Torgersen North Torgersen 2021-10-31 Healthy PAPRI2019 NaN NaT False
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Outlier Visualizations</h3>"), HBox…

🧼 Step 7: Handle Outliers (M06)¢

This module (m06_outlier_handling) applies cleanup strategies to flagged outliers from the detection step:

  • Strategies include:
    • 'clip': Caps values to threshold bounds
    • 'median': Imputes using median
    • 'constant': Replaces with fixed value (e.g., -999)
    • 'none': Leaves values untouched (default)

βš™οΈ Strategy is configured per column or globally via __default__ and __global__.

πŸ“ What’s Exported:

  • βœ… Cleaned DataFrame: df_handled
  • βœ… Handling report (XLSX/CSV)
  • βœ… Optional checkpoint joblib

πŸ› οΈ Adjust cleanup logic, output paths, or constant fill values in handling_config_template.yaml.

InΒ [9]:
# 🧼 M06: Handle Outliers

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m06_outlier_handling.run_handling_pipeline import run_outlier_handling_pipeline

# --- Load module-specific config ---
handling_config_full = load_config("config/handling_config_template.yaml")

# Pass the entire detection_results dictionary, not its unpacked components.
df_handled = run_outlier_handling_pipeline(
    config=handling_config_full,
    df=df_outliers_flagged,
    detection_results=detection_results, # Pass the whole dictionary here
    notebook=notebook_mode,
    run_id=run_id
)
Stage: M06 Outlier Handling ⚠️ | Strategies Used: clip, median | Total Outliers Handled: 19
πŸ“‹ Handling Actions Summary
strategy column outliers_handled details
clip bill_length_mm 1 Clipped 1 values to bounds.
median body_mass_g 18 Imputed 18 values with median (3742.00).
πŸ” Details: Capped Values
Column Row_Index Original_Value Capped_Value
bill_length_mm 5164 62.64 62.635

πŸ”§ Step 8: Impute Missing Values (M07)ΒΆ

This module (m07_imputation) fills missing (NaN) values using a column-specific strategy:

  • 'mean', 'median', or 'mode' for numeric/categorical inference
  • 'constant' for fixed fallback values (e.g., "UNKNOWN" or "1900-01-01")
  • Strategy is configured via the rules.strategies section in the YAML

πŸ“Š If enabled, comparison plots show how categorical columns changed post-imputation
(using the same PlotViewer system).

πŸ“ What’s Exported:

  • βœ… Imputed DataFrame: df_imputed
  • βœ… Report: imputation log (XLSX/CSV)
  • βœ… Plots: before/after comparisons (if enabled)
  • βœ… Optional checkpoint joblib

πŸ› οΈ Configure logic and column-specific strategies in imputation_config_template.yaml.

InΒ [10]:
#πŸ”§ M07: Impute Data and Plot Summary Visuals

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m07_imputation.run_imputation_pipeline import run_imputation_pipeline

# Load the configuration for the imputation module
imputation_config_full = load_config("config/imputation_config_template.yaml")

df_imputed = run_imputation_pipeline(
    config=imputation_config_full,
    notebook=notebook_mode,
    df=df_handled,  # Pass the existing DataFrame here
    run_id=run_id
)
Stage: M07 Data Imputation βœ… | Total Values Filled: 11291 | Columns Affected: 15
πŸ“ˆ Imputation Summary & Null Audit

πŸ“‹ Imputation Actions Log

Column Strategy Fill Value Nulls Filled
bill_length_mm mean 45.17 429
body_mass_g mean 3842.08 406
bill_depth_mm median 17.48 417
flipper_length_mm median 199.31 451
sex mode MALE 2739
tag_id constant UNKNOWN 2242
species constant UNKNOWN 166
age_group constant UNKNOWN 121
colony_id constant UNKNOWN 405
island constant UNKNOWN 584
study_name constant UNKNOWN 563
capture_date constant 1900-01-01 00:00:00 915
date_egg constant 1900-01-01 00:00:00 836
clutch_completion constant UNKNOWN 463
health_status constant UNKNOWN 554

πŸ” Null Value Audit

Column Nulls Before Nulls After Nulls Filled
bill_length_mm 429 0 429
body_mass_g 406 0 406
bill_depth_mm 417 0 417
flipper_length_mm 451 0 451
sex 2739 0 2739
tag_id 2242 0 2242
species 166 0 166
age_group 121 0 121
colony_id 405 0 405
island 584 0 584
study_name 563 0 563
capture_date 915 0 915
date_egg 836 0 836
clutch_completion 463 0 463
health_status 554 0 554
πŸ“Š Categorical Shift Analysis (click to expand & scroll)
Column: sex
Normalized Values
Value Count
MALE 4108
FEMALE 1310
UNKNOWN 123
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
NaN 2739 0 -2739
MALE 1369 4108 2739
FEMALE 1310 1310 0
UNKNOWN 123 123 0
Column: tag_id
Normalized Values
Value Count
UNKNOWN 2242
GEN-0271 5
ADE-0119 4
GEN-0143 4
ADE-0176 4
GEN-0751 4
GEN-0673 4
GEN-0433 4
GEN-0902 4
GEN-0106 4
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
NaN 2242 0 -2242
GEN-0271 5 5 0
ADE-0119 4 4 0
ADE-0176 4 4 0
ADE-0203 4 4 0
CHN-0905 4 4 0
GEN-0054 4 4 0
GEN-0106 4 4 0
GEN-0143 4 4 0
GEN-0433 4 4 0
Column: species
Normalized Values
Value Count
Gentoo 1815
Adelie 1784
Chinstrap 1776
UNKNOWN 166
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Gentoo 1815 1815 0
Adelie 1784 1784 0
Chinstrap 1776 1776 0
NaN 166 0 -166
UNKNOWN 0 166 166
Column: age_group
Normalized Values
Value Count
Adult 3822
Juvenile 1073
Chick 477
UNKNOWN 169
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Adult 3822 3822 0
Juvenile 1073 1073 0
Chick 477 477 0
NaN 121 0 -121
UNKNOWN 48 169 121
Column: colony_id
Normalized Values
Value Count
Torgersen North 1490
Dream South 1216
Biscoe West 1092
Cormorant East 767
Shortcut Point 511
UNKNOWN 465
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Torgersen North 1490 1490 0
Dream South 1216 1216 0
Biscoe West 1092 1092 0
Cormorant East 767 767 0
Shortcut Point 511 511 0
NaN 405 0 -405
UNKNOWN 60 465 405
Column: island
Normalized Values
Value Count
Torgersen 1405
Dream 1184
Biscoe 1084
Cormorant 715
UNKNOWN 643
Shortcut 510
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Torgersen 1405 1405 0
Dream 1184 1184 0
Biscoe 1084 1084 0
Cormorant 715 715 0
NaN 584 0 -584
Shortcut 510 510 0
UNKNOWN 59 643 584
Column: study_name
Normalized Values
Value Count
PAPRI2020 1122
PAPRI2021 1024
PAPRI2022 916
PAPRI2023 824
PAPRI2024 803
UNKNOWN 600
PAPRI2019 252
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
PAPRI2020 1122 1122 0
PAPRI2021 1024 1024 0
PAPRI2022 916 916 0
PAPRI2023 824 824 0
PAPRI2024 803 803 0
NaN 563 0 -563
PAPRI2019 252 252 0
UNKNOWN 37 600 563
Column: clutch_completion
Normalized Values
Value Count
yes 4314
no 764
UNKNOWN 463
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
yes 4314 4314 0
no 764 764 0
NaN 463 0 -463
UNKNOWN 0 463 463
Column: health_status
Normalized Values
Value Count
Healthy 2194
Underweight 1411
Overweight 733
UNKNOWN 584
Critical 323
Sick 296
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Healthy 2194 2194 0
Underweight 1411 1411 0
Overweight 733 733 0
NaN 554 0 -554
Critical 323 323 0
Sick 296 296 0
UNKNOWN 30 584 554
⚠️ Remaining Nulls Found

The following columns still contain null values after imputation:

Column Remaining Nulls
bill_length_mm_iqr_outlier 429
body_mass_g_zscore_outlier 406
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Imputation Visualizations</h3>"), H…

🧩 Behind the Scenes: Utility & Visual Modules¢

Several specialized support modules power the Analyst Toolkit pipeline behind the scenes.
These are not called directly in the notebook, but are crucial to the system’s flexibility and polish:

🧰 m00_utils/¢

  • config_loader.py: Robust loader with support for environment paths and nested YAMLs
  • load_data.py: Abstracted CSV/Joblib loader with encoding fallback
  • export_utils.py: Modular export system for saving reports and checkpoints
  • rendering_utils.py: Styled HTML table generator for dashboard outputs

πŸ“Š m08_visuals/ΒΆ

  • distributions.py: Boxplots, histograms, and violin plots for outlier detection
  • summary_plots.py: Heatmaps, missingness matrices, and dtype summaries
  • plot_viewer.py: Interactive PlotViewer widget for inspecting flagged values and category shifts

βš™οΈ These modules enable notebook-mode display, CLI compatibility, YAML-driven plotting, and clean HTML export dashboards.

πŸ“ Explore these utilities in the /src/ directory to understand how the toolkit remains modular, extensible, and production-grade.

🎬 Step 9: Final Auditing and Certifaction (M10)¢

This final module performs a comprehensive audit of the cleaned dataset and applies strict quality checks before certification.

It serves as the final quality gate and includes:

  • βœ… Final Edits: Drop or rename columns, coerce dtypes as needed
  • βœ… Certification Check: Applies validation rules with fail_on_error: true to enforce schema, dtypes, and content requirements
  • βœ… Lifecycle Comparison: Compares raw vs final structure, nulls, and column presence
  • βœ… Capstone Report: Renders a complete dashboard summarizing pipeline impact and status

πŸ›‘οΈ If any rule is violated (e.g., unexpected nulls or schema mismatch), the system halts and logs failure details for debugging.

πŸ“ What’s Exported:

  • Final Audit Report (XLSX and Joblib)
  • Final Certified Dataset (CSV and Joblib)
  • Inline dashboard with all results

πŸ› οΈ Customize certification rules, null restrictions, or output paths in final_audit_config_template.yaml.

πŸŽ‰ Once this step passes, your dataset is ready for production use or modeling pipelines.

InΒ [11]:
# 🎬 M10: Final Auditing and Certifaction 

from analyst_toolkit.m10_final_audit.final_audit_pipeline import run_final_audit_pipeline
from analyst_toolkit.m00_utils.config_loader import load_config

# --- Load Config ---
final_audit_config_full = load_config("config/final_audit_config_template.yaml")

# --- Run Final Audit ---
# The final audit pipeline expects the full config dictionary, as it may perform
# validation using rules from a separate block.
df_final_clean = run_final_audit_pipeline(
    config=final_audit_config_full,
    df=df_imputed,  # Pass the existing DataFrame here
    notebook=notebook_mode,
    run_id=run_id
)
❌ CERTIFICATION FAILED
⚠️ Failure Details

🚦 Failures Schema Conformity

Issue Type Columns
Unexpected is_duplicate
πŸ“ˆ Pipeline Summary

πŸ“Š Pipeline Status

Metric Value
Final Pipeline Status ❌ CERTIFICATION FAILED
Certification Rules Passed False
Null Value Audit Passed True

πŸ› οΈ Final Edits Log

Action Details
drop_columns Removed: ['body_mass_g_zscore_outlier', 'bill_length_mm_iqr_outlier']
πŸ”¬ Final Data Profile

🧬 Data Lifecycle

Metric Value
Initial Rows 5541
Final Rows 5541
Initial Columns 15
Final Columns 16
Audit Remarks Key:
  • βœ… OK: Passed all configured quality checks.
  • ⚠️ High Skew: Skewness exceeds threshold.
  • ⚠️ Unexpected Type: Data type mismatch.

πŸ“š Data Dictionary / Schema

Column Dtype Unique Values Audit Remarks Missing Count Missing %
tag_id object 2679 βœ… OK 0 0.0
species object 4 βœ… OK 0 0.0
bill_length_mm float64 1985 βœ… OK 0 0.0
bill_depth_mm float64 863 βœ… OK 0 0.0
flipper_length_mm float64 1467 βœ… OK 0 0.0
body_mass_g float64 3324 βœ… OK 0 0.0
age_group object 4 βœ… OK 0 0.0
sex object 3 βœ… OK 0 0.0
colony_id object 6 βœ… OK 0 0.0
island object 6 βœ… OK 0 0.0
capture_date datetime64[ns] 1746 βœ… OK 0 0.0
health_status object 6 βœ… OK 0 0.0
study_name object 7 βœ… OK 0 0.0
clutch_completion object 3 βœ… OK 0 0.0
date_egg datetime64[ns] 1657 βœ… OK 0 0.0
is_duplicate bool 2 βœ… OK 0 0.0
πŸ”’ Descriptive Statistics
Metric count mean std min 25% 50% 75% max skew kurtosis
bill_length_mm 5541.0 45.166681 5.442593 30.63 40.98 45.240 49.07 62.635000 -0.151954 -0.405922
bill_depth_mm 5541.0 17.318895 2.146392 12.37 15.65 17.485 18.92 23.010000 -0.134672 -0.725335
flipper_length_mm 5541.0 201.999903 13.769645 162.79 191.80 199.315 213.00 252.400000 0.392703 -0.397019
body_mass_g 5541.0 3842.084375 845.336672 2376.56 3264.00 3806.000 4266.00 6965.072934 0.552218 0.015302
πŸ“„ Data Preview (.head)
tag_id species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg is_duplicate
UNKNOWN Gentoo 48.99 14.11 220.900 5890.0 Adult MALE Torgersen North Torgersen 2023-11-17 UNKNOWN PAPRI2023 yes 2023-11-09 True
UNKNOWN Gentoo 48.99 14.11 220.900 5890.0 Adult MALE Torgersen North Torgersen 2023-11-17 UNKNOWN PAPRI2023 yes 2023-11-09 True
ADE-0001 Adelie 39.55 19.92 186.200 2500.0 Chick MALE Biscoe West Biscoe 1900-01-01 Underweight PAPRI2022 yes 2022-07-20 True
UNKNOWN Gentoo 48.23 13.00 199.315 4536.0 Adult FEMALE Biscoe West UNKNOWN 2024-04-14 Healthy UNKNOWN yes 2024-04-12 False
GEN-0001 Gentoo 46.22 13.91 212.800 2500.0 Juvenile FEMALE Dream South Dream 1900-01-01 Underweight PAPRI2020 yes 2020-04-14 True

🧭 What’s Next?ΒΆ

Congratulations β€” you’ve now completed a full walkthrough of the Analyst Toolkit pipeline using synthetic Palmer Penguins data!

Here are some suggested next steps:

  1. πŸ” Explore Outputs

    • Review the exported reports and plots in the exports/ folder
    • Inspect final audit and certification summaries
  2. πŸ§ͺ Test with Other Datasets

    • Replace the penguin dataset with your own CSV in the YAML configs
    • Adjust schema, value, and range rules accordingly
  3. πŸ““ Use the Full Pipeline Script

    • Try running run_toolkit_pipeline.py in CLI or notebook mode for a full end-to-end execution
    • Config: config/run_toolkit_config.yaml
  4. πŸ› οΈ Customize Modules

    • Add new modules (e.g., feature engineering, modeling)
    • Use your own diagnostic thresholds or imputation logic
  5. πŸš€ Package or Deploy

    • Deploy the toolkit in production (Airflow, Papermill, GitHub Actions, etc.)
    • Or package it as a Python module for reuse