🧪 Full Pipeline Execution (Notebook Mode)¶
This notebook demonstrates the full pipeline execution using the master controller script run_toolkit_pipeline.py
.
- Controlled via:
config/run_toolkit_config.yaml
- Executes all pipeline modules in sequence (M01–M10)
- Outputs: Dashboards, reports, plots, and the final certified dataset
- ✅ Set
notebook: true
in the YAML to enable inline dashboards
📂 Final outputs are exported to the
exports/
anddata/processed/
directories.
📎 Notes & Use Cases
🧭 Notes
- Fully modular pipeline execution from raw to certified clean data
- Configurable behavior using a single YAML file
- Can be executed interactively (with displays) or headlessly (silent mode)
💼 Use Cases
- End-to-end QA audits for new or synthetic datasets
- Validating preprocessing logic during exploratory workflows
- Certifying pipeline output before downstream modeling
- Showcasing toolkit capabilities in interviews or portfolio reviews
🔁 Alternate Modes
- Set
notebook: false
in the YAML to run this notebook silently (ideal for automation or CI). - Run the pipeline as a CLI script outside notebooks with:
python run_toolkit_pipeline.py --config config/run_toolkit_config.yaml
from analyst_toolkit.run_toolkit_pipeline import run_full_pipeline
final_df = run_full_pipeline(config_path="config/run_toolkit_config.yaml")
2025-08-05 12:47:35,316 - INFO - --- Loading Master Orchestration Config from config/run_toolkit_config.yaml --- 2025-08-05 12:47:35,317 - INFO - --- 🚚 Loading initial data from data/raw/synthetic_penguins_v3.5.csv --- 2025-08-05 12:47:35,323 - INFO - --- 🚀 Starting Module: DIAGNOSTICS ---
📈 Key Metrics
🔷 Shape
Rows | Columns |
---|---|
5541 | 15 |
🧠 Memory Usage
Memory Usage |
---|
3.26 MB |
♻️ Duplicate Summary
Duplicate Rows | Duplicate % |
---|---|
1 | 0.02 |
📝 Full Profile & Cardinality
🔢 High Cardinality
Column | Unique Values |
---|---|
tag_id | 2678 |
capture_date | 1917 |
date_egg | 1656 |
colony_id | 19 |
study_name | 12 |
island | 11 |
- ✅ OK: Passed all configured quality checks.
- ⚠️ High Skew: Skewness exceeds the configured threshold.
- ⚠️ Unexpected Type: Data type does not match the expected type.
📚 Full Data Profile
Column | Dtype | Unique Values | Audit Remarks | Missing Count | Missing % |
---|---|---|---|---|---|
tag_id | object | 2678 | ✅ OK | 2242 | 40.46 |
species | object | 5 | ✅ OK | 166 | 3.00 |
bill length (mm) | float64 | 1984 | ✅ OK | 429 | 7.74 |
bill depth (mm) | float64 | 862 | ✅ OK | 417 | 7.53 |
flipper_length_mm | float64 | 1466 | ✅ OK | 451 | 8.14 |
body_mass_g | float64 | 3328 | ✅ OK | 406 | 7.33 |
age_group | object | 7 | ✅ OK | 121 | 2.18 |
sex | object | 6 | ✅ OK | 2739 | 49.43 |
colony_id | object | 19 | ✅ OK | 405 | 7.31 |
island | object | 11 | ✅ OK | 584 | 10.54 |
capture_date | object | 1917 | ✅ OK | 534 | 9.64 |
health_status | object | 9 | ✅ OK | 554 | 10.00 |
study_name | object | 12 | ✅ OK | 563 | 10.16 |
clutch_completion | object | 2 | ✅ OK | 463 | 8.36 |
date_egg | object | 1656 | ✅ OK | 836 | 15.09 |
🔬 Quantitative Summary
🔢 Descriptive Statistics
Metric | count | mean | std | min | 25% | 50% | 75% | max | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|
bill length (mm) | 5112.0 | 45.166682 | 5.666410 | 30.63 | 40.51 | 45.950 | 49.360 | 62.64 | -0.145952 | -0.606829 |
bill depth (mm) | 5124.0 | 17.305377 | 2.231495 | 12.37 | 15.49 | 17.485 | 19.030 | 23.01 | -0.111456 | -0.897492 |
flipper_length_mm | 5090.0 | 202.237800 | 14.342621 | 162.79 | 191.10 | 199.315 | 214.100 | 252.40 | 0.329099 | -0.616376 |
body_mass_g | 5135.0 | 3853.645265 | 898.232986 | 2376.56 | 3219.50 | 3742.000 | 4376.515 | 7378.33 | 0.616778 | 0.086446 |
📄 Preview of Duplicated Rows
tag_id | species | bill length (mm) | bill depth (mm) | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
🔍 First Rows Preview
📋 First 5 Rows (.head)
tag_id | species | bill length (mm) | bill depth (mm) | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | Male | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | Yes | 2023-11-09 |
ADE-0001 | Adelie | 39.55 | 19.92 | 186.2 | 2500.0 | Chick | Male | Biscoe West | Biscoe | 2024-13-03 | Underweight | PAPRI2022 | Yes | 2022-07-20 |
NaN | Gentoo | 48.23 | 13.00 | NaN | 4536.0 | Adult | Female | Biscoe West | NaN | 2024-04-14 | Healthy | NaN | Yes | 2024-04-12 |
GEN-0001 | Gentoo | 46.22 | 13.91 | 212.8 | 2500.0 | Juvenile | Female | Dream South | Dream | NaN | Underweight | PAPRI2020 | Yes | 2020-04-14 |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Profile</h3>"), HBox(childre…
🔎 Validation Rules Summary
Validation Rule | Description | Status |
---|---|---|
Schema Conformity | Verify column names match the expected schema. | ⚠️ Fail (2 issues) |
Dtype Enforcement | Verify column data types match expectations. | ⚠️ Fail (1 issues) |
Categorical Values | Verify values in categorical columns are within an allowed set. | ⚠️ Fail (7 issues) |
Numeric Ranges | Verify values in numeric columns are within a defined range. | ✅ Pass |
- ✅ Pass: The data conforms to this rule.
- ⚠️ Fail: One or more issues were found. See drill-down for details.
Failure Details
⚠️ Drill-Down: Schema Conformity(click to expand & scroll)
Issue Type | Columns |
---|---|
Missing | bill depth_mm, bill_length_mm |
Unexpected | bill length (mm), bill depth (mm) |
⚠️ Drill-Down: Dtype Enforcement(click to expand & scroll)
Column | Expected Type | Actual Type |
---|---|---|
flipper_length_mm | int64 | float64 |
⚠️ Drill-Down: Categorical Values(click to expand & scroll)
Rule Violated:
Values for column species
must be in the allowed set.
Allowed Values:
['Adelie', 'Chinstrap', 'Gentoo']
Invalid Values Found:
Invalid Value | Count |
---|---|
adeleie | 148 |
Gentto | 145 |
Rule Violated:
Values for column island
must be in the allowed set.
Allowed Values:
['Dream', 'Biscoe', 'Torgersen', 'Cormorant', 'Shortcut']
Invalid Values Found:
Invalid Value | Count |
---|---|
short cut | 70 |
torg | 61 |
unknown | 59 |
bisco | 55 |
cormor | 47 |
dreamland | 46 |
Rule Violated:
Values for column sex
must be in the allowed set.
Allowed Values:
['male', 'female', 'UNKNOWN']
Invalid Values Found:
Invalid Value | Count |
---|---|
Male | 1308 |
Female | 1227 |
F | 83 |
? | 74 |
M | 61 |
Unknown | 49 |
Rule Violated:
Values for column colony_id
must be in the allowed set.
Allowed Values:
['Biscoe West', 'Cormorant East', 'Dream South', 'Shortcut Point', 'Torgersen North']
Invalid Values Found:
Invalid Value | Count |
---|---|
cormorant NW | 45 |
invalid_colony | 36 |
Torgersen | 35 |
Cormorant | 34 |
biscoe 2 | 34 |
torgersen SE | 31 |
TORGERSEN 4 | 30 |
short point | 28 |
/Shortcut | 26 |
Biscoe | 25 |
dream island | 24 |
Unknown | 24 |
Dream Island | 22 |
dream | 19 |
Rule Violated:
Values for column age_group
must be in the allowed set.
Allowed Values:
['Juvenile', 'Adult', 'Chick', 'UNKNOWN']
Invalid Values Found:
Invalid Value | Count |
---|---|
juvenille | 58 |
unk | 48 |
ADLT | 47 |
chik | 29 |
Rule Violated:
Values for column health_status
must be in the allowed set.
Allowed Values:
['Healthy', 'Critically Ill', 'Underweight', 'Unwell', 'Overweight', 'Unknown']
Invalid Values Found:
Invalid Value | Count |
---|---|
critcal ill | 36 |
Overwight | 34 |
under weight | 33 |
ok | 30 |
Rule Violated:
Values for column study_name
must be in the allowed set.
Allowed Values:
['PAPRI2019', 'PAPRI2020', 'PAPRI2021', 'PAPRI2022', 'PAPRI2023', 'PAPRI2024']
Invalid Values Found:
Invalid Value | Count |
---|---|
PAPR12021 | 60 |
papri2024 | 58 |
STUDY_2022 | 57 |
PP2020 | 48 |
PAPR2023 | 46 |
PAPRI20X9 | 37 |
⚙️ Normalization Actions (Transform Log)
✏️ Columns Renamed (2)
Original Name | New Name |
---|---|
bill length (mm) | bill_length_mm |
bill depth (mm) | bill_depth_mm |
🧹 Strings Cleaned (2)
Column | Operation |
---|---|
clutch_completion | standardize_text |
sex | standardize_text |
📅 Datetimes Parsed (2)
Column | Target Type |
---|---|
capture_date | datetime64[ns] |
date_egg | datetime64[ns] |
🧩 Values Mapped (7)
Column | Mappings Applied |
---|---|
sex | 7 |
species | 1 |
island | 1 |
colony_id | 14 |
age_group | 4 |
health_status | 7 |
study_name | 6 |
🤖 Fuzzy Matches (7)
Column | Original | Corrected | Score |
---|---|---|---|
species | Gentto | Gentoo | 83 |
species | adeleie | Adelie | 92 |
island | bisco | Biscoe | 91 |
island | short cut | Shortcut | 94 |
island | dreamland | Dream | 90 |
island | cormor | Cormorant | 90 |
island | torg | Torgersen | 90 |
📊 Column Value Analysis: Before & After(click to scroll)
Column: sex
Value | Count |
---|---|
NaN | 2739 |
MALE | 1369 |
FEMALE | 1310 |
UNKNOWN | 123 |
Value | Original Count | Normalized Count |
---|---|---|
NaN | 2739 | 2739 |
Male | 1308 | 0 |
Female | 1227 | 0 |
F | 83 | 0 |
? | 74 | 0 |
M | 61 | 0 |
Unknown | 49 | 0 |
MALE | 0 | 1369 |
FEMALE | 0 | 1310 |
UNKNOWN | 0 | 123 |
Column: island
Value | Count |
---|---|
Torgersen | 1405 |
Dream | 1184 |
Biscoe | 1084 |
Cormorant | 715 |
NaN | 584 |
Shortcut | 510 |
UNKNOWN | 59 |
Value | Original Count | Normalized Count |
---|---|---|
Torgersen | 1344 | 1405 |
Dream | 1138 | 1184 |
Biscoe | 1029 | 1084 |
Cormorant | 668 | 715 |
NaN | 584 | 584 |
Shortcut | 440 | 510 |
short cut | 70 | 0 |
torg | 61 | 0 |
unknown | 59 | 0 |
bisco | 55 | 0 |
cormor | 47 | 0 |
dreamland | 46 | 0 |
UNKNOWN | 0 | 59 |
Column: species
Value | Count |
---|---|
Gentoo | 1815 |
Adelie | 1784 |
Chinstrap | 1776 |
NaN | 166 |
Value | Original Count | Normalized Count |
---|---|---|
Chinstrap | 1776 | 1776 |
Gentoo | 1670 | 1815 |
Adelie | 1636 | 1784 |
NaN | 166 | 166 |
adeleie | 148 | 0 |
Gentto | 145 | 0 |
Column: health_status
Value | Count |
---|---|
Healthy | 2194 |
Underweight | 1411 |
Overweight | 733 |
NaN | 554 |
Critical | 323 |
Sick | 296 |
UNKNOWN | 30 |
Value | Original Count | Normalized Count |
---|---|---|
Healthy | 2194 | 2194 |
Underweight | 1378 | 1411 |
Overweight | 699 | 733 |
NaN | 554 | 554 |
Unwell | 296 | 0 |
Critically Ill | 287 | 0 |
critcal ill | 36 | 0 |
Overwight | 34 | 0 |
under weight | 33 | 0 |
ok | 30 | 0 |
Critical | 0 | 323 |
Sick | 0 | 296 |
UNKNOWN | 0 | 30 |
Column: colony_id
Value | Count |
---|---|
Torgersen North | 1490 |
Dream South | 1216 |
Biscoe West | 1092 |
Cormorant East | 767 |
Shortcut Point | 511 |
NaN | 405 |
UNKNOWN | 60 |
Value | Original Count | Normalized Count |
---|---|---|
Torgersen North | 1394 | 1490 |
Dream South | 1151 | 1216 |
Biscoe West | 1033 | 1092 |
Cormorant East | 688 | 767 |
Shortcut Point | 457 | 511 |
NaN | 405 | 405 |
cormorant NW | 45 | 0 |
invalid_colony | 36 | 0 |
Torgersen | 35 | 0 |
Cormorant | 34 | 0 |
biscoe 2 | 34 | 0 |
torgersen SE | 31 | 0 |
TORGERSEN 4 | 30 | 0 |
short point | 28 | 0 |
/Shortcut | 26 | 0 |
Biscoe | 25 | 0 |
Unknown | 24 | 0 |
dream island | 24 | 0 |
Dream Island | 22 | 0 |
dream | 19 | 0 |
Column: age_group
Value | Count |
---|---|
Adult | 3822 |
Juvenile | 1073 |
Chick | 477 |
NaN | 121 |
UNKNOWN | 48 |
Value | Original Count | Normalized Count |
---|---|---|
Adult | 3775 | 3822 |
Juvenile | 1015 | 1073 |
Chick | 448 | 477 |
NaN | 121 | 121 |
juvenille | 58 | 0 |
unk | 48 | 0 |
ADLT | 47 | 0 |
chik | 29 | 0 |
UNKNOWN | 0 | 48 |
Column: study_name
Value | Count |
---|---|
PAPRI2020 | 1122 |
PAPRI2021 | 1024 |
PAPRI2022 | 916 |
PAPRI2023 | 824 |
PAPRI2024 | 803 |
NaN | 563 |
PAPRI2019 | 252 |
UNKNOWN | 37 |
Value | Original Count | Normalized Count |
---|---|---|
PAPRI2020 | 1074 | 1122 |
PAPRI2021 | 964 | 1024 |
PAPRI2022 | 859 | 916 |
PAPRI2023 | 778 | 824 |
PAPRI2024 | 745 | 803 |
NaN | 563 | 563 |
PAPRI2019 | 252 | 252 |
PAPR12021 | 60 | 0 |
papri2024 | 58 | 0 |
STUDY_2022 | 57 | 0 |
PP2020 | 48 | 0 |
PAPR2023 | 46 | 0 |
PAPRI20X9 | 37 | 0 |
UNKNOWN | 0 | 37 |
Column: capture_date
Value | Count |
---|---|
NaT | 915 |
2023-01-18 | 10 |
2024-05-09 | 10 |
2024-02-01 | 9 |
2023-06-12 | 8 |
2020-12-25 | 8 |
2022-11-15 | 8 |
2023-06-10 | 8 |
2023-03-22 | 8 |
2024-01-01 | 8 |
2022-08-04 | 8 |
2022-12-03 | 8 |
2024-06-19 | 8 |
2023-09-27 | 7 |
2022-09-28 | 7 |
2022-09-27 | 7 |
2023-10-22 | 7 |
2024-04-25 | 7 |
2023-07-25 | 7 |
2023-08-24 | 7 |
Value | Original Count | Normalized Count |
---|---|---|
NaN | 534 | 915 |
9999-99-99 | 39 | 0 |
error | 33 | 0 |
not-a-date | 30 | 0 |
2023-01-18 | 10 | 10 |
2024-05-09 | 10 | 10 |
2024-02-01 | 9 | 9 |
2020-12-25 | 8 | 8 |
2022-08-04 | 8 | 8 |
2022-11-15 | 8 | 8 |
2022-12-03 | 8 | 8 |
2023-03-22 | 8 | 8 |
2023-06-10 | 8 | 8 |
2023-06-12 | 8 | 8 |
2024-01-01 | 8 | 8 |
2024-06-19 | 8 | 8 |
2020-07-02 | 7 | 7 |
2021-01-21 | 7 | 7 |
2022-01-09 | 7 | 7 |
2022-09-27 | 7 | 7 |
Column: date_egg
Value | Count |
---|---|
NaT | 836 |
2019-12-11 | 13 |
2019-12-27 | 12 |
2020-10-11 | 11 |
2020-07-20 | 11 |
2019-12-17 | 11 |
2019-11-25 | 11 |
2020-06-25 | 11 |
2021-04-03 | 10 |
2021-04-16 | 10 |
2023-10-08 | 10 |
2021-07-05 | 9 |
2022-10-26 | 9 |
2021-01-06 | 9 |
2022-07-13 | 9 |
2022-02-07 | 9 |
2020-01-22 | 9 |
2021-08-30 | 9 |
2020-09-20 | 9 |
2020-01-17 | 9 |
Value | Original Count | Normalized Count |
---|---|---|
NaN | 836 | 836 |
2019-12-11 | 13 | 13 |
2019-12-27 | 12 | 12 |
2019-11-25 | 11 | 11 |
2019-12-17 | 11 | 11 |
2020-06-25 | 11 | 11 |
2020-07-20 | 11 | 11 |
2020-10-11 | 11 | 11 |
2021-04-03 | 10 | 10 |
2021-04-16 | 10 | 10 |
2023-10-08 | 10 | 10 |
2020-01-17 | 9 | 9 |
2020-01-22 | 9 | 9 |
2020-02-26 | 9 | 9 |
2020-09-20 | 9 | 9 |
2021-01-06 | 9 | 9 |
2021-07-05 | 9 | 9 |
2021-08-30 | 9 | 9 |
2021-10-22 | 9 | 9 |
2022-02-07 | 9 | 9 |
Column: clutch_completion
Value | Count |
---|---|
yes | 4314 |
no | 764 |
NaN | 463 |
Value | Original Count | Normalized Count |
---|---|---|
Yes | 4314 | 0 |
No | 764 | 0 |
NaN | 463 | 463 |
yes | 0 | 4314 |
no | 0 | 764 |
🔎 Validation Rules Summary
Validation Rule | Description | Status |
---|---|---|
Schema Conformity | Verify column names match the expected schema. | ✅ Pass |
Dtype Enforcement | Verify column data types match expectations. | ✅ Pass |
Categorical Values | Verify values in categorical columns are within an allowed set. | ✅ Pass |
Numeric Ranges | Verify values in numeric columns are within a defined range. | ✅ Pass |
- ✅ Pass: The data conforms to this rule.
- ⚠️ Fail: One or more issues were found. See drill-down for details.
📈 Summary of Changes
Metric | Value |
---|---|
Original Row Count | 5541 |
Deduplicated Row Count | 5540 |
Rows Removed | 1 |
🔍 Duplicate Clusters Found (click to scroll)
tag_id | species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | MALE | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | yes | 2023-11-09 |
NaN | Gentoo | 48.99 | 14.11 | 220.9 | 5890.0 | Adult | MALE | Torgersen North | Torgersen | 2023-11-17 | NaN | PAPRI2023 | yes | 2023-11-09 |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Summary</h3>"), HBox(childre…
📋 Outlier Detection Log
column | method | outlier_count | lower_bound | upper_bound | outlier_examples |
---|---|---|---|---|---|
bill_length_mm | iqr | 1 | 27.235000 | 62.63500 | [62.64] |
body_mass_g | zscore | 18 | 710.701428 | 6995.79582 | [7000.0, 7000.0, 7000.0, 7000.0, 7000.0] |
🔍 Preview of Rows Containing Outliers
tag_id | species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NaN | Gentoo | NaN | 14.41 | 221.90 | 7000.00 | Adult | NaN | Torgersen North | Torgersen | 2019-10-31 | Healthy | PAPRI2019 | NaN | NaT |
NaN | NaN | 47.68 | 17.62 | NaN | 7000.00 | Adult | NaN | Torgersen North | Torgersen | 2021-08-17 | Healthy | PAPRI2021 | NaN | 2021-08-14 |
GEN-0041 | Gentoo | 45.63 | 14.13 | 213.20 | 7000.00 | Juvenile | FEMALE | Dream South | Dream | 2021-12-02 | Healthy | PAPRI2021 | NaN | 2021-11-23 |
NaN | Gentoo | 46.39 | 13.84 | 206.30 | 7000.00 | Adult | NaN | Cormorant East | Cormorant | 2022-10-26 | Healthy | PAPRI2022 | NaN | 2022-10-12 |
ADE-0182 | Adelie | 38.46 | 17.16 | 185.10 | 7000.00 | Adult | NaN | Dream South | Dream | 2024-02-03 | Overweight | PAPRI2024 | yes | 2024-01-31 |
NaN | Gentoo | 49.36 | 13.00 | 224.10 | 7000.00 | Adult | NaN | Torgersen North | Torgersen | NaT | Healthy | NaN | no | NaT |
NaN | Gentoo | 40.59 | 14.37 | 230.00 | 7000.00 | Adult | MALE | NaN | Biscoe | NaT | Healthy | PAPRI2021 | yes | 2021-03-25 |
GEN-0301 | Gentoo | 44.56 | 16.48 | 212.70 | 7000.00 | Adult | MALE | Biscoe West | Biscoe | 2022-12-12 | Healthy | PAPRI2022 | no | NaT |
NaN | Gentoo | 45.16 | 15.57 | 218.40 | 7000.00 | Adult | FEMALE | NaN | Cormorant | 2021-07-30 | Healthy | PAPRI2021 | yes | 2021-07-17 |
GEN-0681 | Gentoo | 44.73 | 13.94 | 217.80 | 7000.00 | Adult | NaN | Torgersen North | Torgersen | NaT | Healthy | PAPRI2022 | yes | 2022-11-07 |
GEN-0706 | Gentoo | 45.74 | 14.02 | 217.80 | 7000.00 | Adult | NaN | Dream South | Dream | 2024-02-28 | Healthy | PAPRI2024 | yes | 2024-02-21 |
GEN-0743 | Gentoo | 49.05 | 14.49 | 213.20 | 7000.00 | Adult | FEMALE | Dream South | Dream | NaT | Healthy | PAPRI2020 | yes | 2020-03-17 |
CHN-0860 | Chinstrap | 50.88 | 18.49 | 206.10 | 7000.00 | Adult | NaN | Cormorant East | Cormorant | 2024-07-09 | Overweight | PAPRI2023 | yes | 2023-11-16 |
GEN-0974 | Gentoo | 50.57 | 15.89 | 220.00 | 7000.00 | Adult | NaN | Torgersen North | NaN | 2021-01-05 | NaN | PAPRI2021 | yes | 2020-12-26 |
GEN-0681 | Gentoo | 47.77 | 13.84 | 222.73 | 7378.33 | Adult | NaN | Torgersen North | Torgersen | NaT | Overweight | PAPRI2022 | yes | 2022-11-07 |
NaN | Chinstrap | 51.63 | 18.69 | 212.94 | 7128.38 | Adult | FEMALE | Torgersen North | Torgersen | 2022-03-25 | Overweight | PAPRI2020 | NaN | 2020-03-12 |
NaN | Gentoo | 47.71 | 13.93 | 236.20 | 7085.98 | Adult | NaN | Torgersen North | Torgersen | NaT | Critical | NaN | no | NaT |
CHN-0219 | Chinstrap | 62.64 | 18.00 | 204.26 | 2770.38 | Juvenile | NaN | Torgersen North | UNKNOWN | 2020-10-22 | Critical | PAPRI2019 | yes | 2019-10-16 |
NaN | Gentoo | NaN | 14.99 | 219.59 | 7128.48 | Adult | NaN | Torgersen North | Torgersen | 2021-10-31 | Healthy | PAPRI2019 | NaN | NaT |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Outlier Visualizations</h3>"), HBox…
📋 Handling Actions Summary
strategy | column | outliers_handled | details |
---|---|---|---|
clip | bill_length_mm | 1 | Clipped 1 values to bounds. |
median | body_mass_g | 18 | Imputed 18 values with median (3742.00). |
🔍 Details: Capped Values
Column | Row_Index | Original_Value | Capped_Value |
---|---|---|---|
bill_length_mm | 5164 | 62.64 | 62.635 |
📈 Imputation Summary & Null Audit
📋 Imputation Actions Log
Column | Strategy | Fill Value | Nulls Filled |
---|---|---|---|
bill_length_mm | mean | 45.17 | 429 |
body_mass_g | mean | 3841.69 | 406 |
bill_depth_mm | median | 17.49 | 417 |
flipper_length_mm | median | 199.31 | 451 |
sex | mode | MALE | 2739 |
tag_id | constant | UNKNOWN | 2241 |
species | constant | UNKNOWN | 166 |
age_group | constant | UNKNOWN | 121 |
colony_id | constant | UNKNOWN | 405 |
island | constant | UNKNOWN | 584 |
study_name | constant | UNKNOWN | 563 |
capture_date | constant | 1900-01-01 00:00:00 | 915 |
date_egg | constant | 1900-01-01 00:00:00 | 836 |
clutch_completion | constant | UNKNOWN | 463 |
health_status | constant | UNKNOWN | 553 |
🔍 Null Value Audit
Column | Nulls Before | Nulls After | Nulls Filled |
---|---|---|---|
bill_length_mm | 429 | 0 | 429 |
body_mass_g | 406 | 0 | 406 |
bill_depth_mm | 417 | 0 | 417 |
flipper_length_mm | 451 | 0 | 451 |
sex | 2739 | 0 | 2739 |
tag_id | 2241 | 0 | 2241 |
species | 166 | 0 | 166 |
age_group | 121 | 0 | 121 |
colony_id | 405 | 0 | 405 |
island | 584 | 0 | 584 |
study_name | 563 | 0 | 563 |
capture_date | 915 | 0 | 915 |
date_egg | 836 | 0 | 836 |
clutch_completion | 463 | 0 | 463 |
health_status | 553 | 0 | 553 |
📊 Categorical Shift Analysis (click to expand & scroll)
Column: sex
Value | Count |
---|---|
MALE | 4107 |
FEMALE | 1310 |
UNKNOWN | 123 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
NaN | 2739 | 0 | -2739 |
MALE | 1368 | 4107 | 2739 |
FEMALE | 1310 | 1310 | 0 |
UNKNOWN | 123 | 123 | 0 |
Column: tag_id
Value | Count |
---|---|
UNKNOWN | 2241 |
GEN-0271 | 5 |
ADE-0119 | 4 |
GEN-0143 | 4 |
ADE-0176 | 4 |
GEN-0751 | 4 |
GEN-0673 | 4 |
GEN-0433 | 4 |
GEN-0902 | 4 |
GEN-0106 | 4 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
NaN | 2241 | 0 | -2241 |
GEN-0271 | 5 | 5 | 0 |
ADE-0119 | 4 | 4 | 0 |
ADE-0176 | 4 | 4 | 0 |
ADE-0203 | 4 | 4 | 0 |
CHN-0905 | 4 | 4 | 0 |
GEN-0054 | 4 | 4 | 0 |
GEN-0106 | 4 | 4 | 0 |
GEN-0143 | 4 | 4 | 0 |
GEN-0433 | 4 | 4 | 0 |
Column: species
Value | Count |
---|---|
Gentoo | 1814 |
Adelie | 1784 |
Chinstrap | 1776 |
UNKNOWN | 166 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Gentoo | 1814 | 1814 | 0 |
Adelie | 1784 | 1784 | 0 |
Chinstrap | 1776 | 1776 | 0 |
NaN | 166 | 0 | -166 |
UNKNOWN | 0 | 166 | 166 |
Column: age_group
Value | Count |
---|---|
Adult | 3821 |
Juvenile | 1073 |
Chick | 477 |
UNKNOWN | 169 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Adult | 3821 | 3821 | 0 |
Juvenile | 1073 | 1073 | 0 |
Chick | 477 | 477 | 0 |
NaN | 121 | 0 | -121 |
UNKNOWN | 48 | 169 | 121 |
Column: colony_id
Value | Count |
---|---|
Torgersen North | 1489 |
Dream South | 1216 |
Biscoe West | 1092 |
Cormorant East | 767 |
Shortcut Point | 511 |
UNKNOWN | 465 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Torgersen North | 1489 | 1489 | 0 |
Dream South | 1216 | 1216 | 0 |
Biscoe West | 1092 | 1092 | 0 |
Cormorant East | 767 | 767 | 0 |
Shortcut Point | 511 | 511 | 0 |
NaN | 405 | 0 | -405 |
UNKNOWN | 60 | 465 | 405 |
Column: island
Value | Count |
---|---|
Torgersen | 1404 |
Dream | 1184 |
Biscoe | 1084 |
Cormorant | 715 |
UNKNOWN | 643 |
Shortcut | 510 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Torgersen | 1404 | 1404 | 0 |
Dream | 1184 | 1184 | 0 |
Biscoe | 1084 | 1084 | 0 |
Cormorant | 715 | 715 | 0 |
NaN | 584 | 0 | -584 |
Shortcut | 510 | 510 | 0 |
UNKNOWN | 59 | 643 | 584 |
Column: study_name
Value | Count |
---|---|
PAPRI2020 | 1122 |
PAPRI2021 | 1024 |
PAPRI2022 | 916 |
PAPRI2023 | 823 |
PAPRI2024 | 803 |
UNKNOWN | 600 |
PAPRI2019 | 252 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
PAPRI2020 | 1122 | 1122 | 0 |
PAPRI2021 | 1024 | 1024 | 0 |
PAPRI2022 | 916 | 916 | 0 |
PAPRI2023 | 823 | 823 | 0 |
PAPRI2024 | 803 | 803 | 0 |
NaN | 563 | 0 | -563 |
PAPRI2019 | 252 | 252 | 0 |
UNKNOWN | 37 | 600 | 563 |
Column: clutch_completion
Value | Count |
---|---|
yes | 4313 |
no | 764 |
UNKNOWN | 463 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
yes | 4313 | 4313 | 0 |
no | 764 | 764 | 0 |
NaN | 463 | 0 | -463 |
UNKNOWN | 0 | 463 | 463 |
Column: health_status
Value | Count |
---|---|
Healthy | 2194 |
Underweight | 1411 |
Overweight | 733 |
UNKNOWN | 583 |
Critical | 323 |
Sick | 296 |
Value | Original Count | Imputed Count | Change |
---|---|---|---|
Healthy | 2194 | 2194 | 0 |
Underweight | 1411 | 1411 | 0 |
Overweight | 733 | 733 | 0 |
NaN | 553 | 0 | -553 |
Critical | 323 | 323 | 0 |
Sick | 296 | 296 | 0 |
UNKNOWN | 30 | 583 | 553 |
⚠️ Remaining Nulls Found
The following columns still contain null values after imputation:
Column | Remaining Nulls |
---|---|
bill_length_mm_iqr_outlier | 429 |
body_mass_g_zscore_outlier | 406 |
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Imputation Visualizations</h3>"), H…
📈 Pipeline Summary
📊 Pipeline Status
Metric | Value |
---|---|
Final Pipeline Status | ✅ PIPELINE CERTIFIED |
Certification Rules Passed | True |
Null Value Audit Passed | True |
🛠️ Final Edits Log
Action | Details |
---|---|
drop_columns | Removed: ['body_mass_g_zscore_outlier', 'bill_length_mm_iqr_outlier'] |
🔬 Final Data Profile
🧬 Data Lifecycle
Metric | Value |
---|---|
Initial Rows | 5541 |
Final Rows | 5540 |
Initial Columns | 15 |
Final Columns | 15 |
- ✅ OK: Passed all configured quality checks.
- ⚠️ High Skew: Skewness exceeds threshold.
- ⚠️ Unexpected Type: Data type mismatch.
📚 Data Dictionary / Schema
Column | Dtype | Unique Values | Audit Remarks | Missing Count | Missing % |
---|---|---|---|---|---|
tag_id | object | 2679 | ✅ OK | 0 | 0.0 |
species | object | 4 | ✅ OK | 0 | 0.0 |
bill_length_mm | float64 | 1985 | ✅ OK | 0 | 0.0 |
bill_depth_mm | float64 | 862 | ✅ OK | 0 | 0.0 |
flipper_length_mm | float64 | 1466 | ✅ OK | 0 | 0.0 |
body_mass_g | float64 | 3324 | ✅ OK | 0 | 0.0 |
age_group | object | 4 | ✅ OK | 0 | 0.0 |
sex | object | 3 | ✅ OK | 0 | 0.0 |
colony_id | object | 6 | ✅ OK | 0 | 0.0 |
island | object | 6 | ✅ OK | 0 | 0.0 |
capture_date | datetime64[ns] | 1746 | ✅ OK | 0 | 0.0 |
health_status | object | 6 | ✅ OK | 0 | 0.0 |
study_name | object | 7 | ✅ OK | 0 | 0.0 |
clutch_completion | object | 3 | ✅ OK | 0 | 0.0 |
date_egg | datetime64[ns] | 1657 | ✅ OK | 0 | 0.0 |
🔢 Descriptive Statistics
Metric | count | mean | std | min | 25% | 50% | 75% | max | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 5540.0 | 45.165933 | 5.442842 | 30.63 | 40.9775 | 45.235 | 49.07 | 62.635000 | -0.151610 | -0.406055 |
bill_depth_mm | 5540.0 | 17.319850 | 2.146182 | 12.37 | 15.6575 | 17.490 | 18.92 | 23.010000 | -0.135465 | -0.724695 |
flipper_length_mm | 5540.0 | 201.996085 | 13.768625 | 162.79 | 191.8000 | 199.310 | 213.00 | 252.400000 | 0.393223 | -0.395981 |
body_mass_g | 5540.0 | 3841.685482 | 844.964960 | 2376.56 | 3263.7500 | 3806.000 | 4263.75 | 6965.072934 | 0.551892 | 0.015965 |
📄 Data Preview (.head)
tag_id | species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | age_group | sex | colony_id | island | capture_date | health_status | study_name | clutch_completion | date_egg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNKNOWN | Gentoo | 48.99 | 14.11 | 220.90 | 5890.0 | Adult | MALE | Torgersen North | Torgersen | 2023-11-17 | UNKNOWN | PAPRI2023 | yes | 2023-11-09 |
ADE-0001 | Adelie | 39.55 | 19.92 | 186.20 | 2500.0 | Chick | MALE | Biscoe West | Biscoe | 1900-01-01 | Underweight | PAPRI2022 | yes | 2022-07-20 |
UNKNOWN | Gentoo | 48.23 | 13.00 | 199.31 | 4536.0 | Adult | FEMALE | Biscoe West | UNKNOWN | 2024-04-14 | Healthy | UNKNOWN | yes | 2024-04-12 |
GEN-0001 | Gentoo | 46.22 | 13.91 | 212.80 | 2500.0 | Juvenile | FEMALE | Dream South | Dream | 1900-01-01 | Underweight | PAPRI2020 | yes | 2020-04-14 |
UNKNOWN | Chinstrap | 49.02 | 16.22 | 192.20 | 3735.0 | Adult | MALE | Biscoe West | Biscoe | 2022-10-03 | Healthy | PAPRI2022 | yes | 2022-10-02 |
🛠️ Next Steps¶
This notebook demonstrates the full analyst pipeline using notebook mode. The following enhancements are planned or encouraged for production workflows:
✅ CLI and Automation¶
Use the CLI version for scheduled or automated runs:
python run_toolkit_pipeline.py --config config/run_toolkit_config.yaml
Integrate into GitHub Actions or cron jobs for continuous data QA
Swap YAML configs to support different datasets or audit targets
🚀 Planned Iterations¶
- Add dynamic changelog to fallow data end to end.
- Extend to namespace, and add addtional modules;
- ML Module Evaluation Suite
- Visual EDA Suite
- Optional integration with cloud storage (GCS / S3) for inputs and outputs
- Create a streamlined CLI onboarding script (e.g., init_pipeline.py) to scaffold configs
📦 Packaging Notes¶
- The toolkit is TOML-packaged and installable as a local Python module
- Follows modular design to support interactive, notebook, and script-based workflows