π§ͺ Analyst Toolkit Tutorial: Full Data PipelineΒΆ
This interactive notebook demonstrates the complete analyst pipeline using a synthetic Palmer Penguins dataset generated from the dirty birds data synthetic data generator
repository.
Each step in the pipeline is modular, YAML-configurable, and produces exports, plots, and certification-ready reports.
This toolkit is packaged using TOML (pyproject.toml
) and can be run via script or notebook.
π§° Toolkit Architecture: 3-Way Modular DesignΒΆ
This pipeline is built around a flexible ETL framework with three usage modes:
π Notebook Mode
- Run individual modules or the full pipeline interactively
- Supports HTML dashboards, widgets, and live previews
- Ideal for iterative exploration, first-pass audits, and QA workflows
π§΅ CLI Mode
- Execute the full pipeline using
run_toolkit_pipeline.py
- Controlled via a master YAML config
- Exports all reports, checkpoints, and logs to disk
- Execute the full pipeline using
π§ͺ Hybrid Mode
- Develop in notebooks, deploy via scripts
- Reuse the same configs across testing and production
The toolkit handles essential data cleaning and transformation tasks, enabling analysts to focus on:
- Exploratory Data Analysis (EDA)
- Investigating anomalies and data quality issues
- Extracting actionable insights from certified data
# π Load Configuration and Set Execution Context
from analyst_toolkit.m00_utils.config_loader import load_config
# Path to master config (modify if needed)
config_path = "config/run_toolkit_config.yaml"
# Load full configuration dictionary
config = load_config(config_path)
# Extract run-level settings
run_id = config.get("run_id", "default_run")
notebook_mode = config.get("notebook", True)
print(f"π§ Config loaded | Run ID: {run_id} | Notebook Mode: {notebook_mode}")
π§ Config loaded | Run ID: CLI_2_QA | Notebook Mode: True
# π₯ Load Raw Data from CSV
from analyst_toolkit.m00_utils.load_data import load_csv
# Load input path from the global config (or override manually)
input_path = config.get("pipeline_entry_path", "data/raw/synthetic_penguins_v3.5.csv")
print(f"π Loading data from: {input_path}")
# Load into DataFrame
df_raw = load_csv(input_path)
π Loading data from: data/raw/synthetic_penguins_v3.5.csv
π§ͺ Step 1: Run Initial Diagnostics (M01)ΒΆ
This module generates a profile of the raw data: shape, types, nulls, skewness, and sample rows.
This module profiles the raw dataset for key structural and quality checks:
- Memory, Shape, Dtypes
- Missing Values & Skewness
- Duplicate Detection
- Sample Rows & Descriptive Stats
β
All results are rendered in a collapsible dashboard with exportable reports.
You can toggle inline previews and export settings via the YAML config (diag_config_template.yaml
).
π οΈ To modify thresholds or toggle sections, edit the config under
diagnostics.settings
.
# π M01: Data Diagnostics β Profile Structure & Shape
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m01_diagnostics.run_diag_pipeline import run_diag_pipeline
# --- Load module-specific config ---
diag_config_full = load_config("config/diag_config_template.yaml")
# --- Run Diagnostics Module ---
# We pass the df_raw loaded in the previous step.
# The global run_id and notebook_mode are used.
df_profiled = run_diag_pipeline(
config=diag_config_full, # Pass the full config object
df=df_raw,
notebook=notebook_mode,
run_id=run_id
)
π Key Metrics
π· Shape
Rows | Columns |
---|---|
5541 | 15 |
π§ Memory Usage
Memory Usage |
---|
3.26 MB |
β»οΈ Duplicate Summary
Duplicate Rows | Duplicate % |
---|---|
1 | 0.02 |
π Full Profile & Cardinality
π’ High Cardinality
Column | Unique Values |
---|---|
tag_id | 2678 |
capture_date | 1917 |
date_egg | 1656 |
colony_id | 19 |
- β OK: Passed all configured quality checks.
- β οΈ High Skew: Skewness exceeds the configured threshold.
- β οΈ Unexpected Type: Data type does not match the expected type.
π Full Data Profile
Column | Dtype | Unique Values | Audit Remarks | Missing Count | Missing % |
---|---|---|---|---|---|
tag_id | object | 2678 | β OK | 2242 | 40.46 |
species | object | 5 | β OK | 166 | 3.00 |
bill length (mm) | float64 | 1984 | β OK | 429 | 7.74 |
bill depth (mm) | float64 | 862 | β OK | 417 | 7.53 |
flipper_length_mm | float64 | 1466 | β OK | 451 | 8.14 |
body_mass_g | float64 | 3328 | β OK | 406 | 7.33 |
age_group | object | 7 | β OK | 121 | 2.18 |
sex | object | 6 | β OK | 2739 | 49.43 |
colony_id | object | 19 | β OK | 405 | 7.31 |
island | object | 11 | β OK | 584 | 10.54 |
capture_date | object | 1917 | β OK | 534 | 9.64 |
health_status | object | 9 | β OK | 554 | 10.00 |
study_name | object | 12 | β OK | 563 | 10.16 |
clutch_completion | object | 2 | β OK | 463 | 8.36 |
date_egg | object | 1656 | β OK | 836 | 15.09 |
π¬ Quantitative Summary
π’ Descriptive Statistics
Metric | count | mean | std | min | 25% | 50% | 75% | max | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|
bill length (mm) | 5112.0 | 45.166682 | 5.666410 | 30.63 | 40.51 | 45.950 | 49.360 | 62.64 | -0.145952 | -0.606829 |
bill depth (mm) | 5124.0 | 17.305377 | 2.231495 | 12.37 | 15.49 | 17.485 | 19.030 | 23.01 | -0.111456 | -0.897492 |
flipper_length_mm | 5090.0 | 202.237800 | 14.342621 | 162.79 | 191.10 | 199.315 | 214.100 | 252.40 | 0.329099 | -0.616376 |
body_mass_g | 5135.0 | 3853.645265 | 898.232986 | 2376.56 | 3219.50 | 3742.000 | 4376.515 | 7378.33 | 0.616778 | 0.086446 |
π Preview of Duplicated Rows
tag_id | species | bill length (mm) | bill depth (mm) | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
π First Rows Preview
π First 5 Rows (.head)
tag_id | species | bill length (mm) | bill depth (mm) | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
ADE-0001 | Adelie | 39.55 | 19.92 | 186.2 | 2500.0 | Chick | Male | Biscoe West | Biscoe | 2024-13-03 | Underweight | PAPRI2022 | Yes | 2022-07-20 |
NaN | Gentoo | 48.23 | 13.00 | NaN | 4536.0 | Adult | Female | Biscoe West | NaN | 2024-04-14 | Healthy | NaN | Yes | 2024-04-12 |
GEN-0001 | Gentoo | 46.22 | 13.91 | 212.8 | 2500.0 | Juvenile | Female | Dream South | Dream | NaN | Underweight | PAPRI2020 | Yes | 2020-04-14 |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Profile</h3>"), HBox(childreβ¦
π‘οΈ Step 2: Run Schema & Content Validation (M02)ΒΆ
This module audits the dataset against a defined schema to catch issues early and guide cleaning steps:
- Expected Columns & Dtypes
- Allowed Categorical Values
- Numeric Range Checks
- Null Allowance (optional)
β
All results are displayed in a styled validation dashboard with exportable reports.
You can define strict or flexible rules in the YAML config (validation_config_template.yaml
).
π οΈ To adjust enforcement (e.g. halt-on-fail), set
fail_on_error
and update rules undervalidation.schema_validation
.
# π‘οΈ M02: Schema & Content Validation β First Audit Pass
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline
# --- Load module-specific config ---
val_config_full = load_config("config/validation_config_template.yaml")
# --- Run Validation Module ---
df_validated = run_validation_pipeline(
config=val_config_full,
df=df_profiled,
notebook=notebook_mode,
run_id=run_id
)
π Validation Rules Summary
Validation Rule | Description | Status |
---|---|---|
Schema Conformity | Verify column names match the expected schema. | β οΈ Fail (2 issues) |
Dtype Enforcement | Verify column data types match expectations. | β οΈ Fail (1 issues) |
Categorical Values | Verify values in categorical columns are within an allowed set. | β οΈ Fail (7 issues) |
Numeric Ranges | Verify values in numeric columns are within a defined range. | β Pass |
- β Pass: The data conforms to this rule.
- β οΈ Fail: One or more issues were found. See drill-down for details.
Failure Details
β οΈ Drill-Down: Schema Conformity(click to expand & scroll)
Issue Type | Columns |
---|---|
Missing | bill_length_mm, bill depth_mm |
Unexpected | bill depth (mm), bill length (mm) |
β οΈ Drill-Down: Dtype Enforcement(click to expand & scroll)
Column | Expected Type | Actual Type |
---|---|---|
flipper_length_mm | int64 | float64 |
β οΈ Drill-Down: Categorical Values(click to expand & scroll)
Rule Violated:
Values for column species
must be in the allowed set.
Allowed Values:
['Adelie', 'Chinstrap', 'Gentoo']
Invalid Values Found:
Invalid Value | Count |
---|---|
adeleie | 148 |
Gentto | 145 |
Rule Violated:
Values for column island
must be in the allowed set.
Allowed Values:
['Dream', 'Biscoe', 'Torgersen', 'Cormorant', 'Shortcut']
Invalid Values Found:
Invalid Value | Count |
---|---|
short cut | 70 |
torg | 61 |
unknown | 59 |
bisco | 55 |
cormor | 47 |
dreamland | 46 |
Rule Violated:
Values for column sex
must be in the allowed set.
Allowed Values:
['male', 'female', 'UNKNOWN']
Invalid Values Found:
Invalid Value | Count |
---|---|
Male | 1308 |
Female | 1227 |
F | 83 |
? | 74 |
M | 61 |
Unknown | 49 |
Rule Violated:
Values for column colony_id
must be in the allowed set.
Allowed Values:
['Biscoe West', 'Cormorant East', 'Dream South', 'Shortcut Point', 'Torgersen North']
Invalid Values Found:
Invalid Value | Count |
---|---|
cormorant NW | 45 |
invalid_colony | 36 |
Torgersen | 35 |
Cormorant | 34 |
biscoe 2 | 34 |
torgersen SE | 31 |
TORGERSEN 4 | 30 |
short point | 28 |
/Shortcut | 26 |
Biscoe | 25 |
dream island | 24 |
Unknown | 24 |
Dream Island | 22 |
dream | 19 |
Rule Violated:
Values for column age_group
must be in the allowed set.
Allowed Values:
['Juvenile', 'Adult', 'Chick', 'UNKNOWN']
Invalid Values Found:
Invalid Value | Count |
---|---|
juvenille | 58 |
unk | 48 |
ADLT | 47 |
chik | 29 |
Rule Violated:
Values for column health_status
must be in the allowed set.
Allowed Values:
['Healthy', 'Critically Ill', 'Underweight', 'Unwell', 'Overweight', 'Unknown']
Invalid Values Found:
Invalid Value | Count |
---|---|
critcal ill | 36 |
Overwight | 34 |
under weight | 33 |
ok | 30 |
Rule Violated:
Values for column study_name
must be in the allowed set.
Allowed Values:
['PAPRI2019', 'PAPRI2020', 'PAPRI2021', 'PAPRI2022', 'PAPRI2023', 'PAPRI2024']
Invalid Values Found:
Invalid Value | Count |
---|---|
PAPR12021 | 60 |
papri2024 | 58 |
STUDY_2022 | 57 |
PP2020 | 48 |
PAPR2023 | 46 |
PAPRI20X9 | 37 |
π§Ή Step 3: Normalize & Standardize Data (M03)ΒΆ
This module performs rule-based cleaning and normalization to prepare the dataset for certification:
- Column Renaming & Type Coercion
- Value Mapping & Text Cleaning
- Fuzzy Matching & Datetime Parsing
β
Results are rendered in a structured dashboard with before/after comparisons and audit previews.
All rules and output paths are controlled via the YAML config (normalization_config_template.yaml
).
π οΈ To adjust cleaning logic, modify the
rules
block (e.g.value_mappings
,preview_columns
, etc).
# π§Ή M03: Data Normalization β Standardizing Key Fields
import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m03_normalization.run_normalization_pipeline import run_normalization_pipeline
# --- Load Config ---
norm_config_full = load_config("config/normalization_config_template.yaml")
# --- Run Normalization Module ---
# Uses df_validated from the previous step and global run_id/notebook_mode.
df_normalized = run_normalization_pipeline(
config=norm_config_full,
df=df_validated,
notebook=notebook_mode,
run_id=run_id
)
βοΈ Normalization Actions (Transform Log)
βοΈ Columns Renamed (2)
Original Name | New Name |
---|---|
bill length (mm) | bill_length_mm |
bill depth (mm) | bill_depth_mm |
π§Ή Strings Cleaned (2)
Column | Operation |
---|---|
clutch_completion | standardize_text |
sex | standardize_text |
π Datetimes Parsed (2)
Column | Target Type |
---|---|
capture_date | datetime64[ns] |
date_egg | datetime64[ns] |
π§© Values Mapped (7)
Column | Mappings Applied |
---|---|
sex | 7 |
species | 1 |
island | 1 |
colony_id | 14 |
age_group | 4 |
health_status | 7 |
study_name | 6 |
π€ Fuzzy Matches (7)
Column | Original | Corrected | Score |
---|---|---|---|
species | Gentto | Gentoo | 83 |
species | adeleie | Adelie | 92 |
island | bisco | Biscoe | 91 |
island | short cut | Shortcut | 94 |
island | dreamland | Dream | 90 |
island | cormor | Cormorant | 90 |
island | torg | Torgersen | 90 |
π Column Value Analysis: Before & After(click to scroll)
Column: sex
Value | Count |
---|---|
NaN | 2739 |
MALE | 1369 |
FEMALE | 1310 |
UNKNOWN | 123 |
Value | Original Count | Normalized Count |
---|---|---|
NaN | 2739 | 2739 |
Male | 1308 | 0 |
Female | 1227 | 0 |
F | 83 | 0 |
? | 74 | 0 |
M | 61 | 0 |
Unknown | 49 | 0 |
MALE | 0 | 1369 |
FEMALE | 0 | 1310 |
UNKNOWN | 0 | 123 |
Column: island
Value | Count |
---|---|
Torgersen | 1405 |
Dream | 1184 |
Biscoe | 1084 |
Cormorant | 715 |
NaN | 584 |
Shortcut | 510 |
UNKNOWN | 59 |
Value | Original Count | Normalized Count |
---|---|---|
Torgersen | 1344 | 1405 |
Dream | 1138 | 1184 |
Biscoe | 1029 | 1084 |
Cormorant | 668 | 715 |
NaN | 584 | 584 |
Shortcut | 440 | 510 |
short cut | 70 | 0 |
torg | 61 | 0 |
unknown | 59 | 0 |
bisco | 55 | 0 |
cormor | 47 | 0 |
dreamland | 46 | 0 |
UNKNOWN | 0 | 59 |
Column: species
Value | Count |
---|---|
Gentoo | 1815 |
Adelie | 1784 |
Chinstrap | 1776 |
NaN | 166 |
Value | Original Count | Normalized Count |
---|---|---|
Chinstrap | 1776 | 1776 |
Gentoo | 1670 | 1815 |
Adelie | 1636 | 1784 |
NaN | 166 | 166 |
adeleie | 148 | 0 |
Gentto | 145 | 0 |
Column: health_status
Value | Count |
---|---|
Healthy | 2194 |
Underweight | 1411 |
Overweight | 733 |
NaN | 554 |
Critical | 323 |
Sick | 296 |
UNKNOWN | 30 |
Value | Original Count | Normalized Count |
---|---|---|
Healthy | 2194 | 2194 |
Underweight | 1378 | 1411 |
Overweight | 699 | 733 |
NaN | 554 | 554 |
Unwell | 296 | 0 |
Critically Ill | 287 | 0 |
critcal ill | 36 | 0 |
Overwight | 34 | 0 |
under weight | 33 | 0 |
ok | 30 | 0 |
Critical | 0 | 323 |
Sick | 0 | 296 |
UNKNOWN | 0 | 30 |
Column: colony_id
Value | Count |
---|---|
Torgersen North | 1490 |
Dream South | 1216 |
Biscoe West | 1092 |
Cormorant East | 767 |
Shortcut Point | 511 |
NaN | 405 |
UNKNOWN | 60 |
Value | Original Count | Normalized Count |
---|---|---|
Torgersen North | 1394 | 1490 |
Dream South | 1151 | 1216 |
Biscoe West | 1033 | 1092 |
Cormorant East | 688 | 767 |
Shortcut Point | 457 | 511 |
NaN | 405 | 405 |
cormorant NW | 45 | 0 |
invalid_colony | 36 | 0 |
Torgersen | 35 | 0 |
Cormorant | 34 | 0 |
biscoe 2 | 34 | 0 |
torgersen SE | 31 | 0 |
TORGERSEN 4 | 30 | 0 |
short point | 28 | 0 |
/Shortcut | 26 | 0 |
Biscoe | 25 | 0 |
Unknown | 24 | 0 |
dream island | 24 | 0 |
Dream Island | 22 | 0 |
dream | 19 | 0 |
Column: age_group
Value | Count |
---|---|
Adult | 3822 |
Juvenile | 1073 |
Chick | 477 |
NaN | 121 |
UNKNOWN | 48 |
Value | Original Count | Normalized Count |
---|---|---|
Adult | 3775 | 3822 |
Juvenile | 1015 | 1073 |
Chick | 448 | 477 |
NaN | 121 | 121 |
juvenille | 58 | 0 |
unk | 48 | 0 |
ADLT | 47 | 0 |
chik | 29 | 0 |
UNKNOWN | 0 | 48 |
Column: study_name
Value | Count |
---|---|
PAPRI2020 | 1122 |
PAPRI2021 | 1024 |
PAPRI2022 | 916 |
PAPRI2023 | 824 |
PAPRI2024 | 803 |
NaN | 563 |
PAPRI2019 | 252 |
UNKNOWN | 37 |
Value | Original Count | Normalized Count |
---|---|---|
PAPRI2020 | 1074 | 1122 |
PAPRI2021 | 964 | 1024 |
PAPRI2022 | 859 | 916 |
PAPRI2023 | 778 | 824 |
PAPRI2024 | 745 | 803 |
NaN | 563 | 563 |
PAPRI2019 | 252 | 252 |
PAPR12021 | 60 | 0 |
papri2024 | 58 | 0 |
STUDY_2022 | 57 | 0 |
PP2020 | 48 | 0 |
PAPR2023 | 46 | 0 |
PAPRI20X9 | 37 | 0 |
UNKNOWN | 0 | 37 |
Column: capture_date
Value | Count |
---|---|
NaT | 915 |
2023-01-18 | 10 |
2024-05-09 | 10 |
2024-02-01 | 9 |
2023-06-12 | 8 |
2020-12-25 | 8 |
2022-11-15 | 8 |
2023-06-10 | 8 |
2023-03-22 | 8 |
2024-01-01 | 8 |
2022-08-04 | 8 |
2022-12-03 | 8 |
2024-06-19 | 8 |
2023-09-27 | 7 |
2022-09-28 | 7 |
2022-09-27 | 7 |
2023-10-22 | 7 |
2024-04-25 | 7 |
2023-07-25 | 7 |
2023-08-24 | 7 |
Value | Original Count | Normalized Count |
---|---|---|
NaN | 534 | 915 |
9999-99-99 | 39 | 0 |
error | 33 | 0 |
not-a-date | 30 | 0 |
2023-01-18 | 10 | 10 |
2024-05-09 | 10 | 10 |
2024-02-01 | 9 | 9 |
2020-12-25 | 8 | 8 |
2022-08-04 | 8 | 8 |
2022-11-15 | 8 | 8 |
2022-12-03 | 8 | 8 |
2023-03-22 | 8 | 8 |
2023-06-10 | 8 | 8 |
2023-06-12 | 8 | 8 |
2024-01-01 | 8 | 8 |
2024-06-19 | 8 | 8 |
2020-07-02 | 7 | 7 |
2021-01-21 | 7 | 7 |
2022-01-09 | 7 | 7 |
2022-09-27 | 7 | 7 |
Column: date_egg
Value | Count |
---|---|
NaT | 836 |
2019-12-11 | 13 |
2019-12-27 | 12 |
2020-10-11 | 11 |
2020-07-20 | 11 |
2019-12-17 | 11 |
2019-11-25 | 11 |
2020-06-25 | 11 |
2021-04-03 | 10 |
2021-04-16 | 10 |
2023-10-08 | 10 |
2021-07-05 | 9 |
2022-10-26 | 9 |
2021-01-06 | 9 |
2022-07-13 | 9 |
2022-02-07 | 9 |
2020-01-22 | 9 |
2021-08-30 | 9 |
2020-09-20 | 9 |
2020-01-17 | 9 |
Value | Original Count | Normalized Count |
---|---|---|
NaN | 836 | 836 |
2019-12-11 | 13 | 13 |
2019-12-27 | 12 | 12 |
2019-11-25 | 11 | 11 |
2019-12-17 | 11 | 11 |
2020-06-25 | 11 | 11 |
2020-07-20 | 11 | 11 |
2020-10-11 | 11 | 11 |
2021-04-03 | 10 | 10 |
2021-04-16 | 10 | 10 |
2023-10-08 | 10 | 10 |
2020-01-17 | 9 | 9 |
2020-01-22 | 9 | 9 |
2020-02-26 | 9 | 9 |
2020-09-20 | 9 | 9 |
2021-01-06 | 9 | 9 |
2021-07-05 | 9 | 9 |
2021-08-30 | 9 | 9 |
2021-10-22 | 9 | 9 |
2022-02-07 | 9 | 9 |
Column: clutch_completion
Value | Count |
---|---|
yes | 4314 |
no | 764 |
NaN | 463 |
Value | Original Count | Normalized Count |
---|---|---|
Yes | 4314 | 0 |
No | 764 | 0 |
NaN | 463 | 463 |
yes | 0 | 4314 |
no | 0 | 764 |
π‘οΈ Step 4: Certification Gate (M02)ΒΆ
This step re-uses the Validation Module (M02), but with a stricter configuration to act as a quality gate. It is designed to halt the pipeline if violations are found:
- β All column names, data types, categorical values, and numeric ranges must pass
- π
fail_on_error: true
triggers a hard stop on validation failure
π¦ This step can be run at any point in the pipeline β not just the end.
Use it wherever you want to certify a dataset snapshot or block further execution unless data meets expectations.
β
Results are rendered inline with full export support.
All certification rules live in the YAML config (certification_config_template.yaml
).
π οΈ Adjust gatekeeping behavior by modifying schema rules or toggling
fail_on_error
.
# π‘οΈ M02: Certification (Strict Validation Gatekeeper)
import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline
# --- Load Certification Config ---
cert_config_full = load_config("config/certification_config_template.yaml")
# --- Run Final Certification Pass ---
logging.info("π Starting Certification Gate (re-using M02)")
df_certified = run_validation_pipeline(
config=cert_config_full,
df=df_normalized,
notebook=notebook_mode,
run_id=run_id
)
π Validation Rules Summary
Validation Rule | Description | Status |
---|---|---|
Schema Conformity | Verify column names match the expected schema. | β Pass |
Dtype Enforcement | Verify column data types match expectations. | β Pass |
Categorical Values | Verify values in categorical columns are within an allowed set. | β Pass |
Numeric Ranges | Verify values in numeric columns are within a defined range. | β Pass |
- β Pass: The data conforms to this rule.
- β οΈ Fail: One or more issues were found. See drill-down for details.
π§Ή Step 5: Deduplication (M04)ΒΆ
This module identifies and handles duplicate rows in the dataset, using the logic from m04_duplicates
.
You can choose to:
- π Flag duplicates for review
- βοΈ Remove duplicates directly (default: keep first occurrence)
β Configurable logic lets you define:
- Which columns to check for duplication (
subset_columns
) - Whether to flag or drop (
mode: "flag"
or"remove"
) - Columns to preview (hide IDs, timestamps, etc.)
π Results are displayed with an inline preview and summary plots.
π οΈ Adjust deduplication behavior in
dups_config_template.yaml
.
# β»οΈ M04: Deduplication and Duplicates Handling
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m04_duplicates.run_dupes_pipeline import run_duplicates_pipeline
import logging
# --- Load Config ---
dupes_config_full = load_config("config/dups_config_template.yaml")
# --- Run Duplicates Module ---
df_deduped = run_duplicates_pipeline(
config=dupes_config_full,
df=df_certified,
notebook=notebook_mode,
run_id=run_id
)
π Summary of Changes
Metric | Value |
---|---|
Total Row Count | 5541 |
Duplicate Rows Flagged | 1219 |
π Duplicate Clusters Found (click to scroll)
tag_id | species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ADE-0001 | Adelie | 39.55 | 19.92 | 186.20 | 2500.00 | Chick | MALE | Biscoe West | Biscoe | NaT | Underweight | PAPRI2022 | yes | 2022-07-20 |
ADE-0001 | Adelie | 42.60 | 21.37 | 184.50 | 2477.78 | Juvenile | MALE | Biscoe West | Biscoe | NaT | Healthy | PAPRI2022 | yes | 2022-07-20 |
ADE-0001 | Adelie | 38.70 | 20.78 | 202.74 | 2650.73 | Juvenile | MALE | Biscoe West | Biscoe | NaT | Underweight | PAPRI2022 | yes | 2022-07-20 |
ADE-0013 | Adelie | 40.28 | 18.10 | 188.60 | 3224.00 | Juvenile | NaN | NaN | Cormorant | NaT | NaN | PAPRI2022 | yes | 2022-06-18 |
ADE-0013 | Adelie | 41.51 | 19.31 | 182.31 | 3322.26 | Adult | NaN | NaN | Cormorant | NaT | Overweight | PAPRI2022 | yes | 2022-06-18 |
ADE-0049 | Adelie | NaN | 18.46 | 185.40 | 3326.00 | Adult | FEMALE | Shortcut Point | Shortcut | NaT | Healthy | PAPRI2024 | yes | 2024-08-29 |
ADE-0049 | Adelie | NaN | 17.77 | 176.49 | 3175.64 | Adult | FEMALE | Shortcut Point | Shortcut | NaT | Overweight | PAPRI2024 | yes | 2024-08-29 |
ADE-0054 | Adelie | 42.06 | 17.93 | NaN | 4125.00 | Adult | MALE | Biscoe West | Biscoe | NaT | Overweight | PAPRI2022 | NaN | 2022-10-28 |
ADE-0054 | Adelie | 42.53 | 18.07 | NaN | 4342.78 | Adult | MALE | Biscoe West | Biscoe | NaT | Critical | PAPRI2022 | NaN | 2022-10-28 |
ADE-0073 | Adelie | 41.64 | 17.10 | 192.80 | 2500.00 | Chick | FEMALE | Torgersen North | NaN | NaT | Overweight | PAPRI2023 | yes | 2023-02-24 |
ADE-0073 | Adelie | 41.30 | 16.87 | 206.44 | 2567.20 | Juvenile | FEMALE | Torgersen North | NaN | NaT | Underweight | PAPRI2023 | yes | 2023-02-24 |
ADE-0076 | Adelie | 39.82 | 18.13 | 184.90 | 3642.00 | Adult | NaN | Cormorant East | Cormorant | NaT | NaN | PAPRI2021 | yes | 2021-06-18 |
ADE-0076 | Adelie | 42.27 | 18.40 | 183.97 | 3753.19 | Adult | NaN | Cormorant East | Cormorant | NaT | NaN | PAPRI2021 | yes | 2021-06-18 |
ADE-0076 | Adelie | 42.67 | 19.15 | 178.12 | 3569.29 | Adult | NaN | Cormorant East | Cormorant | NaT | NaN | PAPRI2021 | yes | 2021-06-18 |
ADE-0119 | Adelie | 35.76 | 17.78 | 179.84 | 2716.83 | NaN | NaN | Dream South | Dream | 2022-07-08 | Healthy | PAPRI2021 | yes | 2021-07-05 |
ADE-0119 | Adelie | 40.13 | 18.80 | 201.99 | 2983.99 | NaN | NaN | Dream South | Dream | 2022-07-08 | Healthy | PAPRI2021 | yes | 2021-07-05 |
ADE-0137 | Adelie | 43.03 | 17.11 | 193.40 | NaN | Adult | UNKNOWN | Torgersen North | Torgersen | NaT | Healthy | PAPRI2021 | yes | 2021-03-21 |
ADE-0137 | Adelie | 46.46 | 16.39 | 186.41 | NaN | Adult | UNKNOWN | Torgersen North | Torgersen | NaT | Underweight | PAPRI2021 | yes | 2021-03-21 |
ADE-0155 | Adelie | 38.31 | 17.86 | 197.70 | 3512.00 | Adult | NaN | Torgersen North | Torgersen | NaT | Healthy | PAPRI2023 | no | NaT |
ADE-0155 | Adelie | 39.45 | 19.48 | 197.89 | 3527.76 | Adult | NaN | Torgersen North | Torgersen | NaT | Overweight | PAPRI2023 | no | NaT |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Summary</h3>"), HBox(childreβ¦
π Step 6: Detect Outliers (M05)ΒΆ
This module (m05_detect_outliers
) scans numeric columns for outliers using configurable logic:
- Z-Score or IQR methods (per column or global default)
- Adds binary flags (e.g.,
*_outlier
) to the dataset ifappend_flags: true
- Skips non-numeric or excluded fields via
exclude_columns
π Interactive PlotViewer
If enabled, the PlotViewer
renders boxplots, histograms, and violin plots inline
β giving a fast visual summary of where anomalies occur.
π Whatβs Exported:
- β
df_outliers_flagged
: DataFrame with new_outlier
columns - β
detection_results
: thresholds and summary tables - β
Plots: saved to
exports/plots/outliers/{run_id}/
- β Report: XLSX or CSV, based on config
π οΈ Configure methods, thresholds, excluded columns, and plot types in
outlier_config_template.yaml
.
# π M05: Detect Outliers and Plot Visuals
import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m05_detect_outliers.run_detection_pipeline import run_outlier_detection_pipeline
from IPython.display import display
# --- Load module-specific config ---
outlier_config_full = load_config("config/outlier_config_template.yaml")
# The 'df_deduped' variable should be the output from your M04 Duplicates module
if 'df_deduped' in locals():
df_outliers_flagged, detection_results = run_outlier_detection_pipeline(
config=outlier_config_full,
df=df_deduped,
notebook=notebook_mode,
run_id=run_id
)
π Outlier Detection Log
column | method | outlier_count | lower_bound | upper_bound | outlier_examples |
---|---|---|---|---|---|
bill_length_mm | iqr | 1 | 27.235000 | 62.635000 | [62.64] |
body_mass_g | zscore | 18 | 709.829815 | 6997.460715 | [7000.0, 7000.0, 7000.0, 7000.0, 7000.0] |
π Preview of Rows Containing Outliers
tag_id | species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg | is_duplicate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NaN | Gentoo | NaN | 14.41 | 221.90 | 7000.00 | Adult | NaN | Torgersen North | Torgersen | 2019-10-31 | Healthy | PAPRI2019 | NaN | NaT | False |
NaN | NaN | 47.68 | 17.62 | NaN | 7000.00 | Adult | NaN | Torgersen North | Torgersen | 2021-08-17 | Healthy | PAPRI2021 | NaN | 2021-08-14 | False |
GEN-0041 | Gentoo | 45.63 | 14.13 | 213.20 | 7000.00 | Juvenile | FEMALE | Dream South | Dream | 2021-12-02 | Healthy | PAPRI2021 | NaN | 2021-11-23 | False |
NaN | Gentoo | 46.39 | 13.84 | 206.30 | 7000.00 | Adult | NaN | Cormorant East | Cormorant | 2022-10-26 | Healthy | PAPRI2022 | NaN | 2022-10-12 | False |
ADE-0182 | Adelie | 38.46 | 17.16 | 185.10 | 7000.00 | Adult | NaN | Dream South | Dream | 2024-02-03 | Overweight | PAPRI2024 | yes | 2024-01-31 | False |
NaN | Gentoo | 49.36 | 13.00 | 224.10 | 7000.00 | Adult | NaN | Torgersen North | Torgersen | NaT | Healthy | NaN | no | NaT | True |
NaN | Gentoo | 40.59 | 14.37 | 230.00 | 7000.00 | Adult | MALE | NaN | Biscoe | NaT | Healthy | PAPRI2021 | yes | 2021-03-25 | True |
GEN-0301 | Gentoo | 44.56 | 16.48 | 212.70 | 7000.00 | Adult | MALE | Biscoe West | Biscoe | 2022-12-12 | Healthy | PAPRI2022 | no | NaT | False |
NaN | Gentoo | 45.16 | 15.57 | 218.40 | 7000.00 | Adult | FEMALE | NaN | Cormorant | 2021-07-30 | Healthy | PAPRI2021 | yes | 2021-07-17 | False |
GEN-0681 | Gentoo | 44.73 | 13.94 | 217.80 | 7000.00 | Adult | NaN | Torgersen North | Torgersen | NaT | Healthy | PAPRI2022 | yes | 2022-11-07 | True |
GEN-0706 | Gentoo | 45.74 | 14.02 | 217.80 | 7000.00 | Adult | NaN | Dream South | Dream | 2024-02-28 | Healthy | PAPRI2024 | yes | 2024-02-21 | False |
GEN-0743 | Gentoo | 49.05 | 14.49 | 213.20 | 7000.00 | Adult | FEMALE | Dream South | Dream | NaT | Healthy | PAPRI2020 | yes | 2020-03-17 | False |
CHN-0860 | Chinstrap | 50.88 | 18.49 | 206.10 | 7000.00 | Adult | NaN | Cormorant East | Cormorant | 2024-07-09 | Overweight | PAPRI2023 | yes | 2023-11-16 | False |
GEN-0974 | Gentoo | 50.57 | 15.89 | 220.00 | 7000.00 | Adult | NaN | Torgersen North | NaN | 2021-01-05 | NaN | PAPRI2021 | yes | 2020-12-26 | False |
GEN-0681 | Gentoo | 47.77 | 13.84 | 222.73 | 7378.33 | Adult | NaN | Torgersen North | Torgersen | NaT | Overweight | PAPRI2022 | yes | 2022-11-07 | True |
NaN | Chinstrap | 51.63 | 18.69 | 212.94 | 7128.38 | Adult | FEMALE | Torgersen North | Torgersen | 2022-03-25 | Overweight | PAPRI2020 | NaN | 2020-03-12 | False |
NaN | Gentoo | 47.71 | 13.93 | 236.20 | 7085.98 | Adult | NaN | Torgersen North | Torgersen | NaT | Critical | NaN | no | NaT | True |
CHN-0219 | Chinstrap | 62.64 | 18.00 | 204.26 | 2770.38 | Juvenile | NaN | Torgersen North | UNKNOWN | 2020-10-22 | Critical | PAPRI2019 | yes | 2019-10-16 | False |
NaN | Gentoo | NaN | 14.99 | 219.59 | 7128.48 | Adult | NaN | Torgersen North | Torgersen | 2021-10-31 | Healthy | PAPRI2019 | NaN | NaT | False |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Outlier Visualizations</h3>"), HBoxβ¦
π§Ό Step 7: Handle Outliers (M06)ΒΆ
This module (m06_outlier_handling
) applies cleanup strategies to flagged outliers from the detection step:
- Strategies include:
'clip'
: Caps values to threshold bounds'median'
: Imputes using median'constant'
: Replaces with fixed value (e.g.,-999
)'none'
: Leaves values untouched (default)
βοΈ Strategy is configured per column or globally via __default__
and __global__
.
π Whatβs Exported:
- β
Cleaned DataFrame:
df_handled
- β Handling report (XLSX/CSV)
- β Optional checkpoint joblib
π οΈ Adjust cleanup logic, output paths, or constant fill values in
handling_config_template.yaml
.
# π§Ό M06: Handle Outliers
import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m06_outlier_handling.run_handling_pipeline import run_outlier_handling_pipeline
# --- Load module-specific config ---
handling_config_full = load_config("config/handling_config_template.yaml")
# Pass the entire detection_results dictionary, not its unpacked components.
df_handled = run_outlier_handling_pipeline(
config=handling_config_full,
df=df_outliers_flagged,
detection_results=detection_results, # Pass the whole dictionary here
notebook=notebook_mode,
run_id=run_id
)
π Handling Actions Summary
strategy | column | outliers_handled | details |
---|---|---|---|
clip | bill_length_mm | 1 | Clipped 1 values to bounds. |
median | body_mass_g | 18 | Imputed 18 values with median (3742.00). |
π Details: Capped Values
Column | Row_Index | Original_Value | Capped_Value |
---|---|---|---|
bill_length_mm | 5164 | 62.64 | 62.635 |
π§ Step 8: Impute Missing Values (M07)ΒΆ
This module (m07_imputation
) fills missing (NaN
) values using a column-specific strategy:
'mean'
,'median'
, or'mode'
for numeric/categorical inference'constant'
for fixed fallback values (e.g.,"UNKNOWN"
or"1900-01-01"
)- Strategy is configured via the
rules.strategies
section in the YAML
π If enabled, comparison plots show how categorical columns changed post-imputation
(using the same PlotViewer system).
π Whatβs Exported:
- β
Imputed DataFrame:
df_imputed
- β Report: imputation log (XLSX/CSV)
- β Plots: before/after comparisons (if enabled)
- β Optional checkpoint joblib
π οΈ Configure logic and column-specific strategies in
imputation_config_template.yaml
.
#π§ M07: Impute Data and Plot Summary Visuals
import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m07_imputation.run_imputation_pipeline import run_imputation_pipeline
# Load the configuration for the imputation module
imputation_config_full = load_config("config/imputation_config_template.yaml")
df_imputed = run_imputation_pipeline(
config=imputation_config_full,
notebook=notebook_mode,
df=df_handled, # Pass the existing DataFrame here
run_id=run_id
)
π Imputation Summary & Null Audit
π Imputation Actions Log
Column | Strategy | Fill Value | Nulls Filled |
---|---|---|---|
bill_length_mm | mean | 45.17 | 429 |
body_mass_g | mean | 3842.08 | 406 |
bill_depth_mm | median | 17.48 | 417 |
flipper_length_mm | median | 199.31 | 451 |
sex | mode | MALE | 2739 |
tag_id | constant | UNKNOWN | 2242 |
species | constant | UNKNOWN | 166 |
age_group | constant | UNKNOWN | 121 |
colony_id | constant | UNKNOWN | 405 |
island | constant | UNKNOWN | 584 |
study_name | constant | UNKNOWN | 563 |
capture_date | constant | 1900-01-01 00:00:00 | 915 |
date_egg | constant | 1900-01-01 00:00:00 | 836 |
clutch_completion | constant | UNKNOWN | 463 |
health_status | constant | UNKNOWN | 554 |
π Null Value Audit
Column | Nulls Before | Nulls After | Nulls Filled |
---|---|---|---|
bill_length_mm | 429 | 0 | 429 |
body_mass_g | 406 | 0 | 406 |
bill_depth_mm | 417 | 0 | 417 |
flipper_length_mm | 451 | 0 | 451 |
sex | 2739 | 0 | 2739 |
tag_id | 2242 | 0 | 2242 |
species | 166 | 0 | 166 |
age_group | 121 | 0 | 121 |
colony_id | 405 | 0 | 405 |
island | 584 | 0 | 584 |
study_name | 563 | 0 | 563 |
capture_date | 915 | 0 | 915 |
date_egg | 836 | 0 | 836 |
clutch_completion | 463 | 0 | 463 |
health_status | 554 | 0 | 554 |
π Categorical Shift Analysis (click to expand & scroll)
Column: sex
Value | Count |
---|---|
MALE | 4108 |
FEMALE | 1310 |
UNKNOWN | 123 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
NaN | 2739 | 0 | -2739 |
MALE | 1369 | 4108 | 2739 |
FEMALE | 1310 | 1310 | 0 |
UNKNOWN | 123 | 123 | 0 |
Column: tag_id
Value | Count |
---|---|
UNKNOWN | 2242 |
GEN-0271 | 5 |
ADE-0119 | 4 |
GEN-0143 | 4 |
ADE-0176 | 4 |
GEN-0751 | 4 |
GEN-0673 | 4 |
GEN-0433 | 4 |
GEN-0902 | 4 |
GEN-0106 | 4 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
NaN | 2242 | 0 | -2242 |
GEN-0271 | 5 | 5 | 0 |
ADE-0119 | 4 | 4 | 0 |
ADE-0176 | 4 | 4 | 0 |
ADE-0203 | 4 | 4 | 0 |
CHN-0905 | 4 | 4 | 0 |
GEN-0054 | 4 | 4 | 0 |
GEN-0106 | 4 | 4 | 0 |
GEN-0143 | 4 | 4 | 0 |
GEN-0433 | 4 | 4 | 0 |
Column: species
Value | Count |
---|---|
Gentoo | 1815 |
Adelie | 1784 |
Chinstrap | 1776 |
UNKNOWN | 166 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Gentoo | 1815 | 1815 | 0 |
Adelie | 1784 | 1784 | 0 |
Chinstrap | 1776 | 1776 | 0 |
NaN | 166 | 0 | -166 |
UNKNOWN | 0 | 166 | 166 |
Column: age_group
Value | Count |
---|---|
Adult | 3822 |
Juvenile | 1073 |
Chick | 477 |
UNKNOWN | 169 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Adult | 3822 | 3822 | 0 |
Juvenile | 1073 | 1073 | 0 |
Chick | 477 | 477 | 0 |
NaN | 121 | 0 | -121 |
UNKNOWN | 48 | 169 | 121 |
Column: colony_id
Value | Count |
---|---|
Torgersen North | 1490 |
Dream South | 1216 |
Biscoe West | 1092 |
Cormorant East | 767 |
Shortcut Point | 511 |
UNKNOWN | 465 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Torgersen North | 1490 | 1490 | 0 |
Dream South | 1216 | 1216 | 0 |
Biscoe West | 1092 | 1092 | 0 |
Cormorant East | 767 | 767 | 0 |
Shortcut Point | 511 | 511 | 0 |
NaN | 405 | 0 | -405 |
UNKNOWN | 60 | 465 | 405 |
Column: island
Value | Count |
---|---|
Torgersen | 1405 |
Dream | 1184 |
Biscoe | 1084 |
Cormorant | 715 |
UNKNOWN | 643 |
Shortcut | 510 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Torgersen | 1405 | 1405 | 0 |
Dream | 1184 | 1184 | 0 |
Biscoe | 1084 | 1084 | 0 |
Cormorant | 715 | 715 | 0 |
NaN | 584 | 0 | -584 |
Shortcut | 510 | 510 | 0 |
UNKNOWN | 59 | 643 | 584 |
Column: study_name
Value | Count |
---|---|
PAPRI2020 | 1122 |
PAPRI2021 | 1024 |
PAPRI2022 | 916 |
PAPRI2023 | 824 |
PAPRI2024 | 803 |
UNKNOWN | 600 |
PAPRI2019 | 252 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
PAPRI2020 | 1122 | 1122 | 0 |
PAPRI2021 | 1024 | 1024 | 0 |
PAPRI2022 | 916 | 916 | 0 |
PAPRI2023 | 824 | 824 | 0 |
PAPRI2024 | 803 | 803 | 0 |
NaN | 563 | 0 | -563 |
PAPRI2019 | 252 | 252 | 0 |
UNKNOWN | 37 | 600 | 563 |
Column: clutch_completion
Value | Count |
---|---|
yes | 4314 |
no | 764 |
UNKNOWN | 463 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
yes | 4314 | 4314 | 0 |
no | 764 | 764 | 0 |
NaN | 463 | 0 | -463 |
UNKNOWN | 0 | 463 | 463 |
Column: health_status
Value | Count |
---|---|
Healthy | 2194 |
Underweight | 1411 |
Overweight | 733 |
UNKNOWN | 584 |
Critical | 323 |
Sick | 296 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Healthy | 2194 | 2194 | 0 |
Underweight | 1411 | 1411 | 0 |
Overweight | 733 | 733 | 0 |
NaN | 554 | 0 | -554 |
Critical | 323 | 323 | 0 |
Sick | 296 | 296 | 0 |
UNKNOWN | 30 | 584 | 554 |
β οΈ Remaining Nulls Found
The following columns still contain null values after imputation:
Column | Remaining Nulls |
---|---|
bill_length_mm_iqr_outlier | 429 |
body_mass_g_zscore_outlier | 406 |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Imputation Visualizations</h3>"), Hβ¦
π§© Behind the Scenes: Utility & Visual ModulesΒΆ
Several specialized support modules power the Analyst Toolkit pipeline behind the scenes.
These are not called directly in the notebook, but are crucial to the systemβs flexibility and polish:
π§° m00_utils/
ΒΆ
config_loader.py
: Robust loader with support for environment paths and nested YAMLsload_data.py
: Abstracted CSV/Joblib loader with encoding fallbackexport_utils.py
: Modular export system for saving reports and checkpointsrendering_utils.py
: Styled HTML table generator for dashboard outputs
π m08_visuals/
ΒΆ
distributions.py
: Boxplots, histograms, and violin plots for outlier detectionsummary_plots.py
: Heatmaps, missingness matrices, and dtype summariesplot_viewer.py
: Interactive PlotViewer widget for inspecting flagged values and category shifts
βοΈ These modules enable notebook-mode display, CLI compatibility, YAML-driven plotting, and clean HTML export dashboards.
π Explore these utilities in the /src/
directory to understand how the toolkit remains modular, extensible, and production-grade.
π¬ Step 9: Final Auditing and Certifaction (M10)ΒΆ
This final module performs a comprehensive audit of the cleaned dataset and applies strict quality checks before certification.
It serves as the final quality gate and includes:
- β Final Edits: Drop or rename columns, coerce dtypes as needed
- β
Certification Check: Applies validation rules with
fail_on_error: true
to enforce schema, dtypes, and content requirements - β Lifecycle Comparison: Compares raw vs final structure, nulls, and column presence
- β Capstone Report: Renders a complete dashboard summarizing pipeline impact and status
π‘οΈ If any rule is violated (e.g., unexpected nulls or schema mismatch), the system halts and logs failure details for debugging.
π Whatβs Exported:
- Final Audit Report (XLSX and Joblib)
- Final Certified Dataset (CSV and Joblib)
- Inline dashboard with all results
π οΈ Customize certification rules, null restrictions, or output paths in
final_audit_config_template.yaml
.
π Once this step passes, your dataset is ready for production use or modeling pipelines.
# π¬ M10: Final Auditing and Certifaction
from analyst_toolkit.m10_final_audit.final_audit_pipeline import run_final_audit_pipeline
from analyst_toolkit.m00_utils.config_loader import load_config
# --- Load Config ---
final_audit_config_full = load_config("config/final_audit_config_template.yaml")
# --- Run Final Audit ---
# The final audit pipeline expects the full config dictionary, as it may perform
# validation using rules from a separate block.
df_final_clean = run_final_audit_pipeline(
config=final_audit_config_full,
df=df_imputed, # Pass the existing DataFrame here
notebook=notebook_mode,
run_id=run_id
)
β οΈ Failure Details
π¦ Failures Schema Conformity
Issue Type | Columns |
---|---|
Unexpected | is_duplicate |
π Pipeline Summary
π Pipeline Status
Metric | Value |
---|---|
Final Pipeline Status | β CERTIFICATION FAILED |
Certification Rules Passed | False |
Null Value Audit Passed | True |
π οΈ Final Edits Log
Action | Details |
---|---|
drop_columns | Removed: ['body_mass_g_zscore_outlier', 'bill_length_mm_iqr_outlier'] |
π¬ Final Data Profile
𧬠Data Lifecycle
Metric | Value |
---|---|
Initial Rows | 5541 |
Final Rows | 5541 |
Initial Columns | 15 |
Final Columns | 16 |
- β OK: Passed all configured quality checks.
- β οΈ High Skew: Skewness exceeds threshold.
- β οΈ Unexpected Type: Data type mismatch.
π Data Dictionary / Schema
Column | Dtype | Unique Values | Audit Remarks | Missing Count | Missing % |
---|---|---|---|---|---|
tag_id | object | 2679 | β OK | 0 | 0.0 |
species | object | 4 | β OK | 0 | 0.0 |
bill_length_mm | float64 | 1985 | β OK | 0 | 0.0 |
bill_depth_mm | float64 | 863 | β OK | 0 | 0.0 |
flipper_length_mm | float64 | 1467 | β OK | 0 | 0.0 |
body_mass_g | float64 | 3324 | β OK | 0 | 0.0 |
age_group | object | 4 | β OK | 0 | 0.0 |
sex | object | 3 | β OK | 0 | 0.0 |
colony_id | object | 6 | β OK | 0 | 0.0 |
island | object | 6 | β OK | 0 | 0.0 |
capture_date | datetime64[ns] | 1746 | β OK | 0 | 0.0 |
health_status | object | 6 | β OK | 0 | 0.0 |
study_name | object | 7 | β OK | 0 | 0.0 |
clutch_completion | object | 3 | β OK | 0 | 0.0 |
date_egg | datetime64[ns] | 1657 | β OK | 0 | 0.0 |
is_duplicate | bool | 2 | β OK | 0 | 0.0 |
π’ Descriptive Statistics
Metric | count | mean | std | min | 25% | 50% | 75% | max | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 5541.0 | 45.166681 | 5.442593 | 30.63 | 40.98 | 45.240 | 49.07 | 62.635000 | -0.151954 | -0.405922 |
bill_depth_mm | 5541.0 | 17.318895 | 2.146392 | 12.37 | 15.65 | 17.485 | 18.92 | 23.010000 | -0.134672 | -0.725335 |
flipper_length_mm | 5541.0 | 201.999903 | 13.769645 | 162.79 | 191.80 | 199.315 | 213.00 | 252.400000 | 0.392703 | -0.397019 |
body_mass_g | 5541.0 | 3842.084375 | 845.336672 | 2376.56 | 3264.00 | 3806.000 | 4266.00 | 6965.072934 | 0.552218 | 0.015302 |
π Data Preview (.head)
tag_id | species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg | is_duplicate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNKNOWN | Gentoo | 48.99 | 14.11 | 220.900 | 5890.0 | Adult | MALE | Torgersen North | Torgersen | 2023-11-17 | UNKNOWN | PAPRI2023 | yes | 2023-11-09 | True |
UNKNOWN | Gentoo | 48.99 | 14.11 | 220.900 | 5890.0 | Adult | MALE | Torgersen North | Torgersen | 2023-11-17 | UNKNOWN | PAPRI2023 | yes | 2023-11-09 | True |
ADE-0001 | Adelie | 39.55 | 19.92 | 186.200 | 2500.0 | Chick | MALE | Biscoe West | Biscoe | 1900-01-01 | Underweight | PAPRI2022 | yes | 2022-07-20 | True |
UNKNOWN | Gentoo | 48.23 | 13.00 | 199.315 | 4536.0 | Adult | FEMALE | Biscoe West | UNKNOWN | 2024-04-14 | Healthy | UNKNOWN | yes | 2024-04-12 | False |
GEN-0001 | Gentoo | 46.22 | 13.91 | 212.800 | 2500.0 | Juvenile | FEMALE | Dream South | Dream | 1900-01-01 | Underweight | PAPRI2020 | yes | 2020-04-14 | True |
π§ Whatβs Next?ΒΆ
Congratulations β youβve now completed a full walkthrough of the Analyst Toolkit pipeline using synthetic Palmer Penguins data!
Here are some suggested next steps:
π Explore Outputs
- Review the exported reports and plots in the
exports/
folder - Inspect final audit and certification summaries
- Review the exported reports and plots in the
π§ͺ Test with Other Datasets
- Replace the penguin dataset with your own CSV in the YAML configs
- Adjust schema, value, and range rules accordingly
π Use the Full Pipeline Script
- Try running
run_toolkit_pipeline.py
in CLI or notebook mode for a full end-to-end execution - Config:
config/run_toolkit_config.yaml
- Try running
π οΈ Customize Modules
- Add new modules (e.g., feature engineering, modeling)
- Use your own diagnostic thresholds or imputation logic
π Package or Deploy
- Deploy the toolkit in production (Airflow, Papermill, GitHub Actions, etc.)
- Or package it as a Python module for reuse