🧪 Full Pipeline Execution (Notebook Mode)¶

This notebook demonstrates the full pipeline execution using the master controller script run_toolkit_pipeline.py.

  • Controlled via: config/run_toolkit_config.yaml
  • Executes all pipeline modules in sequence (M01–M10)
  • Outputs: Dashboards, reports, plots, and the final certified dataset
  • ✅ Set notebook: true in the YAML to enable inline dashboards

📂 Final outputs are exported to the exports/ and data/processed/ directories.


📎 Notes & Use Cases

🧭 Notes

  • Fully modular pipeline execution from raw to certified clean data
  • Configurable behavior using a single YAML file
  • Can be executed interactively (with displays) or headlessly (silent mode)

💼 Use Cases

  • End-to-end QA audits for new or synthetic datasets
  • Validating preprocessing logic during exploratory workflows
  • Certifying pipeline output before downstream modeling
  • Showcasing toolkit capabilities in interviews or portfolio reviews
🔁 Alternate Modes
  • Set notebook: false in the YAML to run this notebook silently (ideal for automation or CI).
  • Run the pipeline as a CLI script outside notebooks with:
python run_toolkit_pipeline.py --config config/run_toolkit_config.yaml
In [1]:
from analyst_toolkit.run_toolkit_pipeline import run_full_pipeline

final_df = run_full_pipeline(config_path="config/run_toolkit_config.yaml")
2025-08-05 12:47:35,316 - INFO - --- Loading Master Orchestration Config from config/run_toolkit_config.yaml ---
2025-08-05 12:47:35,317 - INFO - --- 🚚 Loading initial data from data/raw/synthetic_penguins_v3.5.csv ---
2025-08-05 12:47:35,323 - INFO - --- 🚀 Starting Module: DIAGNOSTICS ---
Stage: M01 Data Diagnostics ✅ | Columns with Nulls: 15 | Duplicate Rows Found: 1.0 | Shape: 5541 Rows, 15 Columns
📈 Key Metrics

🔷 Shape

Rows Columns
5541 15

🧠 Memory Usage

Memory Usage
3.26 MB

♻️ Duplicate Summary

Duplicate Rows Duplicate %
1 0.02
📝 Full Profile & Cardinality

🔢 High Cardinality

Column Unique Values
tag_id 2678
capture_date 1917
date_egg 1656
colony_id 19
study_name 12
island 11
Audit Remarks Key:
  • ✅ OK: Passed all configured quality checks.
  • ⚠️ High Skew: Skewness exceeds the configured threshold.
  • ⚠️ Unexpected Type: Data type does not match the expected type.

📚 Full Data Profile

Column Dtype Unique Values Audit Remarks Missing Count Missing %
tag_id object 2678 ✅ OK 2242 40.46
species object 5 ✅ OK 166 3.00
bill length (mm) float64 1984 ✅ OK 429 7.74
bill depth (mm) float64 862 ✅ OK 417 7.53
flipper_length_mm float64 1466 ✅ OK 451 8.14
body_mass_g float64 3328 ✅ OK 406 7.33
age_group object 7 ✅ OK 121 2.18
sex object 6 ✅ OK 2739 49.43
colony_id object 19 ✅ OK 405 7.31
island object 11 ✅ OK 584 10.54
capture_date object 1917 ✅ OK 534 9.64
health_status object 9 ✅ OK 554 10.00
study_name object 12 ✅ OK 563 10.16
clutch_completion object 2 ✅ OK 463 8.36
date_egg object 1656 ✅ OK 836 15.09
🔬 Quantitative Summary

🔢 Descriptive Statistics

Metric count mean std min 25% 50% 75% max skew kurtosis
bill length (mm) 5112.0 45.166682 5.666410 30.63 40.51 45.950 49.360 62.64 -0.145952 -0.606829
bill depth (mm) 5124.0 17.305377 2.231495 12.37 15.49 17.485 19.030 23.01 -0.111456 -0.897492
flipper_length_mm 5090.0 202.237800 14.342621 162.79 191.10 199.315 214.100 252.40 0.329099 -0.616376
body_mass_g 5135.0 3853.645265 898.232986 2376.56 3219.50 3742.000 4376.515 7378.33 0.616778 0.086446
📄 Preview of Duplicated Rows
tag_id species bill length (mm) bill depth (mm) flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
🔍 First Rows Preview

📋 First 5 Rows (.head)

tag_id species bill length (mm) bill depth (mm) flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult Male Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 Yes 2023-11-09
ADE-0001 Adelie 39.55 19.92 186.2 2500.0 Chick Male Biscoe West Biscoe 2024-13-03 Underweight PAPRI2022 Yes 2022-07-20
NaN Gentoo 48.23 13.00 NaN 4536.0 Adult Female Biscoe West NaN 2024-04-14 Healthy NaN Yes 2024-04-12
GEN-0001 Gentoo 46.22 13.91 212.8 2500.0 Juvenile Female Dream South Dream NaN Underweight PAPRI2020 Yes 2020-04-14
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Profile</h3>"), HBox(childre…
Stage: M02 Data Validation ⚠️ | Checks Passed: 1/4 | Row Coverage: 36.62%
🔎 Validation Rules Summary
Validation Rule Description Status
Schema Conformity Verify column names match the expected schema. ⚠️ Fail (2 issues)
Dtype Enforcement Verify column data types match expectations. ⚠️ Fail (1 issues)
Categorical Values Verify values in categorical columns are within an allowed set. ⚠️ Fail (7 issues)
Numeric Ranges Verify values in numeric columns are within a defined range. ✅ Pass
Status Key:
  • ✅ Pass: The data conforms to this rule.
  • ⚠️ Fail: One or more issues were found. See drill-down for details.

Failure Details

⚠️ Drill-Down: Schema Conformity(click to expand & scroll)
Issue Type Columns
Missing bill depth_mm, bill_length_mm
Unexpected bill length (mm), bill depth (mm)
⚠️ Drill-Down: Dtype Enforcement(click to expand & scroll)
Column Expected Type Actual Type
flipper_length_mm int64 float64
⚠️ Drill-Down: Categorical Values(click to expand & scroll)

Rule Violated:

Values for column species must be in the allowed set.

Allowed Values:

['Adelie', 'Chinstrap', 'Gentoo']

Invalid Values Found:

Invalid Value Count
adeleie 148
Gentto 145

Rule Violated:

Values for column island must be in the allowed set.

Allowed Values:

['Dream', 'Biscoe', 'Torgersen', 'Cormorant', 'Shortcut']

Invalid Values Found:

Invalid Value Count
short cut 70
torg 61
unknown 59
bisco 55
cormor 47
dreamland 46

Rule Violated:

Values for column sex must be in the allowed set.

Allowed Values:

['male', 'female', 'UNKNOWN']

Invalid Values Found:

Invalid Value Count
Male 1308
Female 1227
F 83
? 74
M 61
Unknown 49

Rule Violated:

Values for column colony_id must be in the allowed set.

Allowed Values:

['Biscoe West', 'Cormorant East', 'Dream South', 'Shortcut Point', 'Torgersen North']

Invalid Values Found:

Invalid Value Count
cormorant NW 45
invalid_colony 36
Torgersen 35
Cormorant 34
biscoe 2 34
torgersen SE 31
TORGERSEN 4 30
short point 28
/Shortcut 26
Biscoe 25
dream island 24
Unknown 24
Dream Island 22
dream 19

Rule Violated:

Values for column age_group must be in the allowed set.

Allowed Values:

['Juvenile', 'Adult', 'Chick', 'UNKNOWN']

Invalid Values Found:

Invalid Value Count
juvenille 58
unk 48
ADLT 47
chik 29

Rule Violated:

Values for column health_status must be in the allowed set.

Allowed Values:

['Healthy', 'Critically Ill', 'Underweight', 'Unwell', 'Overweight', 'Unknown']

Invalid Values Found:

Invalid Value Count
critcal ill 36
Overwight 34
under weight 33
ok 30

Rule Violated:

Values for column study_name must be in the allowed set.

Allowed Values:

['PAPRI2019', 'PAPRI2020', 'PAPRI2021', 'PAPRI2022', 'PAPRI2023', 'PAPRI2024']

Invalid Values Found:

Invalid Value Count
PAPR12021 60
papri2024 58
STUDY_2022 57
PP2020 48
PAPR2023 46
PAPRI20X9 37

Stage: M03 Data Normalization ✅ | Action Types: 5 | Total Transformations: 20
⚙️ Normalization Actions (Transform Log)

✏️ Columns Renamed (2)

Original Name New Name
bill length (mm) bill_length_mm
bill depth (mm) bill_depth_mm

🧹 Strings Cleaned (2)

Column Operation
clutch_completion standardize_text
sex standardize_text

📅 Datetimes Parsed (2)

Column Target Type
capture_date datetime64[ns]
date_egg datetime64[ns]

🧩 Values Mapped (7)

Column Mappings Applied
sex 7
species 1
island 1
colony_id 14
age_group 4
health_status 7
study_name 6

🤖 Fuzzy Matches (7)

Column Original Corrected Score
species Gentto Gentoo 83
species adeleie Adelie 92
island bisco Biscoe 91
island short cut Shortcut 94
island dreamland Dream 90
island cormor Cormorant 90
island torg Torgersen 90
📊 Column Value Analysis: Before & After(click to scroll)
Column: sex
Normalized Values
Value Count
NaN 2739
MALE 1369
FEMALE 1310
UNKNOWN 123
Value Audit
Value Original Count Normalized Count
NaN 2739 2739
Male 1308 0
Female 1227 0
F 83 0
? 74 0
M 61 0
Unknown 49 0
MALE 0 1369
FEMALE 0 1310
UNKNOWN 0 123
Column: island
Normalized Values
Value Count
Torgersen 1405
Dream 1184
Biscoe 1084
Cormorant 715
NaN 584
Shortcut 510
UNKNOWN 59
Value Audit
Value Original Count Normalized Count
Torgersen 1344 1405
Dream 1138 1184
Biscoe 1029 1084
Cormorant 668 715
NaN 584 584
Shortcut 440 510
short cut 70 0
torg 61 0
unknown 59 0
bisco 55 0
cormor 47 0
dreamland 46 0
UNKNOWN 0 59
Column: species
Normalized Values
Value Count
Gentoo 1815
Adelie 1784
Chinstrap 1776
NaN 166
Value Audit
Value Original Count Normalized Count
Chinstrap 1776 1776
Gentoo 1670 1815
Adelie 1636 1784
NaN 166 166
adeleie 148 0
Gentto 145 0
Column: health_status
Normalized Values
Value Count
Healthy 2194
Underweight 1411
Overweight 733
NaN 554
Critical 323
Sick 296
UNKNOWN 30
Value Audit
Value Original Count Normalized Count
Healthy 2194 2194
Underweight 1378 1411
Overweight 699 733
NaN 554 554
Unwell 296 0
Critically Ill 287 0
critcal ill 36 0
Overwight 34 0
under weight 33 0
ok 30 0
Critical 0 323
Sick 0 296
UNKNOWN 0 30
Column: colony_id
Normalized Values
Value Count
Torgersen North 1490
Dream South 1216
Biscoe West 1092
Cormorant East 767
Shortcut Point 511
NaN 405
UNKNOWN 60
Value Audit
Value Original Count Normalized Count
Torgersen North 1394 1490
Dream South 1151 1216
Biscoe West 1033 1092
Cormorant East 688 767
Shortcut Point 457 511
NaN 405 405
cormorant NW 45 0
invalid_colony 36 0
Torgersen 35 0
Cormorant 34 0
biscoe 2 34 0
torgersen SE 31 0
TORGERSEN 4 30 0
short point 28 0
/Shortcut 26 0
Biscoe 25 0
Unknown 24 0
dream island 24 0
Dream Island 22 0
dream 19 0
Column: age_group
Normalized Values
Value Count
Adult 3822
Juvenile 1073
Chick 477
NaN 121
UNKNOWN 48
Value Audit
Value Original Count Normalized Count
Adult 3775 3822
Juvenile 1015 1073
Chick 448 477
NaN 121 121
juvenille 58 0
unk 48 0
ADLT 47 0
chik 29 0
UNKNOWN 0 48
Column: study_name
Normalized Values
Value Count
PAPRI2020 1122
PAPRI2021 1024
PAPRI2022 916
PAPRI2023 824
PAPRI2024 803
NaN 563
PAPRI2019 252
UNKNOWN 37
Value Audit
Value Original Count Normalized Count
PAPRI2020 1074 1122
PAPRI2021 964 1024
PAPRI2022 859 916
PAPRI2023 778 824
PAPRI2024 745 803
NaN 563 563
PAPRI2019 252 252
PAPR12021 60 0
papri2024 58 0
STUDY_2022 57 0
PP2020 48 0
PAPR2023 46 0
PAPRI20X9 37 0
UNKNOWN 0 37
Column: capture_date
Normalized Values
Value Count
NaT 915
2023-01-18 10
2024-05-09 10
2024-02-01 9
2023-06-12 8
2020-12-25 8
2022-11-15 8
2023-06-10 8
2023-03-22 8
2024-01-01 8
2022-08-04 8
2022-12-03 8
2024-06-19 8
2023-09-27 7
2022-09-28 7
2022-09-27 7
2023-10-22 7
2024-04-25 7
2023-07-25 7
2023-08-24 7
Value Audit
Value Original Count Normalized Count
NaN 534 915
9999-99-99 39 0
error 33 0
not-a-date 30 0
2023-01-18 10 10
2024-05-09 10 10
2024-02-01 9 9
2020-12-25 8 8
2022-08-04 8 8
2022-11-15 8 8
2022-12-03 8 8
2023-03-22 8 8
2023-06-10 8 8
2023-06-12 8 8
2024-01-01 8 8
2024-06-19 8 8
2020-07-02 7 7
2021-01-21 7 7
2022-01-09 7 7
2022-09-27 7 7
Column: date_egg
Normalized Values
Value Count
NaT 836
2019-12-11 13
2019-12-27 12
2020-10-11 11
2020-07-20 11
2019-12-17 11
2019-11-25 11
2020-06-25 11
2021-04-03 10
2021-04-16 10
2023-10-08 10
2021-07-05 9
2022-10-26 9
2021-01-06 9
2022-07-13 9
2022-02-07 9
2020-01-22 9
2021-08-30 9
2020-09-20 9
2020-01-17 9
Value Audit
Value Original Count Normalized Count
NaN 836 836
2019-12-11 13 13
2019-12-27 12 12
2019-11-25 11 11
2019-12-17 11 11
2020-06-25 11 11
2020-07-20 11 11
2020-10-11 11 11
2021-04-03 10 10
2021-04-16 10 10
2023-10-08 10 10
2020-01-17 9 9
2020-01-22 9 9
2020-02-26 9 9
2020-09-20 9 9
2021-01-06 9 9
2021-07-05 9 9
2021-08-30 9 9
2021-10-22 9 9
2022-02-07 9 9
Column: clutch_completion
Normalized Values
Value Count
yes 4314
no 764
NaN 463
Value Audit
Value Original Count Normalized Count
Yes 4314 0
No 764 0
NaN 463 463
yes 0 4314
no 0 764
Stage: M02 Data Validation ✅ | Checks Passed: 4/4 | Row Coverage: 100.0%
🔎 Validation Rules Summary
Validation Rule Description Status
Schema Conformity Verify column names match the expected schema. ✅ Pass
Dtype Enforcement Verify column data types match expectations. ✅ Pass
Categorical Values Verify values in categorical columns are within an allowed set. ✅ Pass
Numeric Ranges Verify values in numeric columns are within a defined range. ✅ Pass
Status Key:
  • ✅ Pass: The data conforms to this rule.
  • ⚠️ Fail: One or more issues were found. See drill-down for details.
Stage: M04 Deduplication ⚠️ | Rows Removed: 1 | Criteria: Based on all columns
📈 Summary of Changes
Metric Value
Original Row Count 5541
Deduplicated Row Count 5540
Rows Removed 1
🔍 Duplicate Clusters Found (click to scroll)
tag_id species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult MALE Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 yes 2023-11-09
NaN Gentoo 48.99 14.11 220.9 5890.0 Adult MALE Torgersen North Torgersen 2023-11-17 NaN PAPRI2023 yes 2023-11-09
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Summary</h3>"), HBox(childre…
Stage: M05 Outlier Detection ⚠️ | Total Outliers Found: 19 | Columns Affected: 2
📋 Outlier Detection Log
column method outlier_count lower_bound upper_bound outlier_examples
bill_length_mm iqr 1 27.235000 62.63500 [62.64]
body_mass_g zscore 18 710.701428 6995.79582 [7000.0, 7000.0, 7000.0, 7000.0, 7000.0]
🔍 Preview of Rows Containing Outliers
tag_id species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
NaN Gentoo NaN 14.41 221.90 7000.00 Adult NaN Torgersen North Torgersen 2019-10-31 Healthy PAPRI2019 NaN NaT
NaN NaN 47.68 17.62 NaN 7000.00 Adult NaN Torgersen North Torgersen 2021-08-17 Healthy PAPRI2021 NaN 2021-08-14
GEN-0041 Gentoo 45.63 14.13 213.20 7000.00 Juvenile FEMALE Dream South Dream 2021-12-02 Healthy PAPRI2021 NaN 2021-11-23
NaN Gentoo 46.39 13.84 206.30 7000.00 Adult NaN Cormorant East Cormorant 2022-10-26 Healthy PAPRI2022 NaN 2022-10-12
ADE-0182 Adelie 38.46 17.16 185.10 7000.00 Adult NaN Dream South Dream 2024-02-03 Overweight PAPRI2024 yes 2024-01-31
NaN Gentoo 49.36 13.00 224.10 7000.00 Adult NaN Torgersen North Torgersen NaT Healthy NaN no NaT
NaN Gentoo 40.59 14.37 230.00 7000.00 Adult MALE NaN Biscoe NaT Healthy PAPRI2021 yes 2021-03-25
GEN-0301 Gentoo 44.56 16.48 212.70 7000.00 Adult MALE Biscoe West Biscoe 2022-12-12 Healthy PAPRI2022 no NaT
NaN Gentoo 45.16 15.57 218.40 7000.00 Adult FEMALE NaN Cormorant 2021-07-30 Healthy PAPRI2021 yes 2021-07-17
GEN-0681 Gentoo 44.73 13.94 217.80 7000.00 Adult NaN Torgersen North Torgersen NaT Healthy PAPRI2022 yes 2022-11-07
GEN-0706 Gentoo 45.74 14.02 217.80 7000.00 Adult NaN Dream South Dream 2024-02-28 Healthy PAPRI2024 yes 2024-02-21
GEN-0743 Gentoo 49.05 14.49 213.20 7000.00 Adult FEMALE Dream South Dream NaT Healthy PAPRI2020 yes 2020-03-17
CHN-0860 Chinstrap 50.88 18.49 206.10 7000.00 Adult NaN Cormorant East Cormorant 2024-07-09 Overweight PAPRI2023 yes 2023-11-16
GEN-0974 Gentoo 50.57 15.89 220.00 7000.00 Adult NaN Torgersen North NaN 2021-01-05 NaN PAPRI2021 yes 2020-12-26
GEN-0681 Gentoo 47.77 13.84 222.73 7378.33 Adult NaN Torgersen North Torgersen NaT Overweight PAPRI2022 yes 2022-11-07
NaN Chinstrap 51.63 18.69 212.94 7128.38 Adult FEMALE Torgersen North Torgersen 2022-03-25 Overweight PAPRI2020 NaN 2020-03-12
NaN Gentoo 47.71 13.93 236.20 7085.98 Adult NaN Torgersen North Torgersen NaT Critical NaN no NaT
CHN-0219 Chinstrap 62.64 18.00 204.26 2770.38 Juvenile NaN Torgersen North UNKNOWN 2020-10-22 Critical PAPRI2019 yes 2019-10-16
NaN Gentoo NaN 14.99 219.59 7128.48 Adult NaN Torgersen North Torgersen 2021-10-31 Healthy PAPRI2019 NaN NaT
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Outlier Visualizations</h3>"), HBox…
Stage: M06 Outlier Handling ⚠️ | Strategies Used: clip, median | Total Outliers Handled: 19
📋 Handling Actions Summary
strategy column outliers_handled details
clip bill_length_mm 1 Clipped 1 values to bounds.
median body_mass_g 18 Imputed 18 values with median (3742.00).
🔍 Details: Capped Values
Column Row_Index Original_Value Capped_Value
bill_length_mm 5164 62.64 62.635
Stage: M07 Data Imputation ✅ | Total Values Filled: 11289 | Columns Affected: 15
📈 Imputation Summary & Null Audit

📋 Imputation Actions Log

Column Strategy Fill Value Nulls Filled
bill_length_mm mean 45.17 429
body_mass_g mean 3841.69 406
bill_depth_mm median 17.49 417
flipper_length_mm median 199.31 451
sex mode MALE 2739
tag_id constant UNKNOWN 2241
species constant UNKNOWN 166
age_group constant UNKNOWN 121
colony_id constant UNKNOWN 405
island constant UNKNOWN 584
study_name constant UNKNOWN 563
capture_date constant 1900-01-01 00:00:00 915
date_egg constant 1900-01-01 00:00:00 836
clutch_completion constant UNKNOWN 463
health_status constant UNKNOWN 553

🔍 Null Value Audit

Column Nulls Before Nulls After Nulls Filled
bill_length_mm 429 0 429
body_mass_g 406 0 406
bill_depth_mm 417 0 417
flipper_length_mm 451 0 451
sex 2739 0 2739
tag_id 2241 0 2241
species 166 0 166
age_group 121 0 121
colony_id 405 0 405
island 584 0 584
study_name 563 0 563
capture_date 915 0 915
date_egg 836 0 836
clutch_completion 463 0 463
health_status 553 0 553
📊 Categorical Shift Analysis (click to expand & scroll)
Column: sex
Normalized Values
Value Count
MALE 4107
FEMALE 1310
UNKNOWN 123
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
NaN 2739 0 -2739
MALE 1368 4107 2739
FEMALE 1310 1310 0
UNKNOWN 123 123 0
Column: tag_id
Normalized Values
Value Count
UNKNOWN 2241
GEN-0271 5
ADE-0119 4
GEN-0143 4
ADE-0176 4
GEN-0751 4
GEN-0673 4
GEN-0433 4
GEN-0902 4
GEN-0106 4
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
NaN 2241 0 -2241
GEN-0271 5 5 0
ADE-0119 4 4 0
ADE-0176 4 4 0
ADE-0203 4 4 0
CHN-0905 4 4 0
GEN-0054 4 4 0
GEN-0106 4 4 0
GEN-0143 4 4 0
GEN-0433 4 4 0
Column: species
Normalized Values
Value Count
Gentoo 1814
Adelie 1784
Chinstrap 1776
UNKNOWN 166
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Gentoo 1814 1814 0
Adelie 1784 1784 0
Chinstrap 1776 1776 0
NaN 166 0 -166
UNKNOWN 0 166 166
Column: age_group
Normalized Values
Value Count
Adult 3821
Juvenile 1073
Chick 477
UNKNOWN 169
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Adult 3821 3821 0
Juvenile 1073 1073 0
Chick 477 477 0
NaN 121 0 -121
UNKNOWN 48 169 121
Column: colony_id
Normalized Values
Value Count
Torgersen North 1489
Dream South 1216
Biscoe West 1092
Cormorant East 767
Shortcut Point 511
UNKNOWN 465
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Torgersen North 1489 1489 0
Dream South 1216 1216 0
Biscoe West 1092 1092 0
Cormorant East 767 767 0
Shortcut Point 511 511 0
NaN 405 0 -405
UNKNOWN 60 465 405
Column: island
Normalized Values
Value Count
Torgersen 1404
Dream 1184
Biscoe 1084
Cormorant 715
UNKNOWN 643
Shortcut 510
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Torgersen 1404 1404 0
Dream 1184 1184 0
Biscoe 1084 1084 0
Cormorant 715 715 0
NaN 584 0 -584
Shortcut 510 510 0
UNKNOWN 59 643 584
Column: study_name
Normalized Values
Value Count
PAPRI2020 1122
PAPRI2021 1024
PAPRI2022 916
PAPRI2023 823
PAPRI2024 803
UNKNOWN 600
PAPRI2019 252
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
PAPRI2020 1122 1122 0
PAPRI2021 1024 1024 0
PAPRI2022 916 916 0
PAPRI2023 823 823 0
PAPRI2024 803 803 0
NaN 563 0 -563
PAPRI2019 252 252 0
UNKNOWN 37 600 563
Column: clutch_completion
Normalized Values
Value Count
yes 4313
no 764
UNKNOWN 463
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
yes 4313 4313 0
no 764 764 0
NaN 463 0 -463
UNKNOWN 0 463 463
Column: health_status
Normalized Values
Value Count
Healthy 2194
Underweight 1411
Overweight 733
UNKNOWN 583
Critical 323
Sick 296
Value Audit (Before vs. After)
Value Original Count Imputed Count Change
Healthy 2194 2194 0
Underweight 1411 1411 0
Overweight 733 733 0
NaN 553 0 -553
Critical 323 323 0
Sick 296 296 0
UNKNOWN 30 583 553
⚠️ Remaining Nulls Found

The following columns still contain null values after imputation:

Column Remaining Nulls
bill_length_mm_iqr_outlier 429
body_mass_g_zscore_outlier 406
Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Imputation Visualizations</h3>"), H…
✅ PIPELINE CERTIFIED
📈 Pipeline Summary

📊 Pipeline Status

Metric Value
Final Pipeline Status ✅ PIPELINE CERTIFIED
Certification Rules Passed True
Null Value Audit Passed True

🛠️ Final Edits Log

Action Details
drop_columns Removed: ['body_mass_g_zscore_outlier', 'bill_length_mm_iqr_outlier']
🔬 Final Data Profile

🧬 Data Lifecycle

Metric Value
Initial Rows 5541
Final Rows 5540
Initial Columns 15
Final Columns 15
Audit Remarks Key:
  • ✅ OK: Passed all configured quality checks.
  • ⚠️ High Skew: Skewness exceeds threshold.
  • ⚠️ Unexpected Type: Data type mismatch.

📚 Data Dictionary / Schema

Column Dtype Unique Values Audit Remarks Missing Count Missing %
tag_id object 2679 ✅ OK 0 0.0
species object 4 ✅ OK 0 0.0
bill_length_mm float64 1985 ✅ OK 0 0.0
bill_depth_mm float64 862 ✅ OK 0 0.0
flipper_length_mm float64 1466 ✅ OK 0 0.0
body_mass_g float64 3324 ✅ OK 0 0.0
age_group object 4 ✅ OK 0 0.0
sex object 3 ✅ OK 0 0.0
colony_id object 6 ✅ OK 0 0.0
island object 6 ✅ OK 0 0.0
capture_date datetime64[ns] 1746 ✅ OK 0 0.0
health_status object 6 ✅ OK 0 0.0
study_name object 7 ✅ OK 0 0.0
clutch_completion object 3 ✅ OK 0 0.0
date_egg datetime64[ns] 1657 ✅ OK 0 0.0
🔢 Descriptive Statistics
Metric count mean std min 25% 50% 75% max skew kurtosis
bill_length_mm 5540.0 45.165933 5.442842 30.63 40.9775 45.235 49.07 62.635000 -0.151610 -0.406055
bill_depth_mm 5540.0 17.319850 2.146182 12.37 15.6575 17.490 18.92 23.010000 -0.135465 -0.724695
flipper_length_mm 5540.0 201.996085 13.768625 162.79 191.8000 199.310 213.00 252.400000 0.393223 -0.395981
body_mass_g 5540.0 3841.685482 844.964960 2376.56 3263.7500 3806.000 4263.75 6965.072934 0.551892 0.015965
📄 Data Preview (.head)
tag_id species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g age_group sex colony_id island capture_date health_status study_name clutch_completion date_egg
UNKNOWN Gentoo 48.99 14.11 220.90 5890.0 Adult MALE Torgersen North Torgersen 2023-11-17 UNKNOWN PAPRI2023 yes 2023-11-09
ADE-0001 Adelie 39.55 19.92 186.20 2500.0 Chick MALE Biscoe West Biscoe 1900-01-01 Underweight PAPRI2022 yes 2022-07-20
UNKNOWN Gentoo 48.23 13.00 199.31 4536.0 Adult FEMALE Biscoe West UNKNOWN 2024-04-14 Healthy UNKNOWN yes 2024-04-12
GEN-0001 Gentoo 46.22 13.91 212.80 2500.0 Juvenile FEMALE Dream South Dream 1900-01-01 Underweight PAPRI2020 yes 2020-04-14
UNKNOWN Chinstrap 49.02 16.22 192.20 3735.0 Adult MALE Biscoe West Biscoe 2022-10-03 Healthy PAPRI2022 yes 2022-10-02

🛠️ Next Steps¶

This notebook demonstrates the full analyst pipeline using notebook mode. The following enhancements are planned or encouraged for production workflows:

✅ CLI and Automation¶

  • Use the CLI version for scheduled or automated runs:

    python run_toolkit_pipeline.py --config config/run_toolkit_config.yaml
    
  • Integrate into GitHub Actions or cron jobs for continuous data QA

  • Swap YAML configs to support different datasets or audit targets

🚀 Planned Iterations¶

  • Add dynamic changelog to fallow data end to end.
  • Extend to namespace, and add addtional modules;
    • ML Module Evaluation Suite
    • Visual EDA Suite
  • Optional integration with cloud storage (GCS / S3) for inputs and outputs
  • Create a streamlined CLI onboarding script (e.g., init_pipeline.py) to scaffold configs

📦 Packaging Notes¶

  • The toolkit is TOML-packaged and installable as a local Python module
  • Follows modular design to support interactive, notebook, and script-based workflows