💡 Recommended Environment:
Run this notebook in the model_eval_suite Conda environment for best results.
See setup instructions in the Usage Guide.

⚠️ If you're running this outside Conda, you can install dependencies manually: Uncomment the line below to install from the root requirements file.

# !pip install -r ../../requirements.txt

🧪 Model Evaluation Suite Demo Notebook¶

This notebook demonstrates how to use the Model Evaluation Suite to:

  • Prepare and validate input data.
  • Run a full modeling pipeline using YAML configuration files.
  • Log models and artifacts with MLflow.
  • Evaluate a production candidate against a holdout dataset.
  • Optionally compare against a baseline model for performance drift or uplift.
📦 Project Structure

This notebook expects the following directories and files to exist:

  • config/: contains user-defined YAML configuration files.
  • data/holdout_data/: contains the holdout CSV used in validation.
  • mlruns/: MLflow tracking output.
⚙️ Workflow Overview
  1. Prep Data (Optional) – If needed, run data prep to split/train/test and cache sets.
  2. Run Experiment – Train model(s) as defined in the YAML file using run_pipeline.
  3. Validate Champion – Evaluate the registered MLflow model using validate_and_display.

YAML-driven configuration allows for full modularity, reproducibility, and MLflow registry integration.

📜 Configuration Setup¶

This notebook is driven by modular YAML configuration files, which serve as the central control system for the evaluation suite.

click here to expand section

These YAMLs are edited upfront to define the behavior of each stage in the pipeline. See config_resources/ in the repository for further guidance.

The YAML configuration governs:

  • Filepaths for all inputs and outputs (train/test/holdout, plots, reports, logs)
  • Model architecture, hyperparameters, and estimator type
  • Preprocessing and feature engineering behavior
  • Optional diagnostics modules (e.g., VIF, SHAP, permutation importance)
  • MLflow tracking settings (URI, run tags, experiment names)
  • Model type and parameters
  • Plotting controls and dashboard rendering options
  • Evaluation behavior: segmentation columns, scoring metrics, baseline model comparison

Prebuilt templates are provided. You can download them at config_resources/config.zip and use them as-is or customize them to match your workflow

🔧 Custom Feature Engineering¶

This suite supports plug-and-play custom transformers via the feature_engineering block in your YAML.

click here to expand

Your transformer should follow scikit-learn's fit/transform API and be referenced like this:

feature_engineering:
  run: true
  module: "my_project.custom_features"
  class_name: "MyFeatureTransformer"

Your transformer must follow the fit/transform interface. See docs/feature_engineering.md for a full example.

📚 Dashboard Guidence¶

📉 Pre-Model Diagnostics Dashboard¶

This optional module runs before any model training or validation occurs. It provides key insights into the integrity and statistical structure of your input data. It is driven by the pre_model_diagnostics block in your YAML and is best used in notebook workflows.

click here to expand section
  • Overview
    Summary of the input dataset, schema, and basic shape metadata.

  • Missingness
    Tabulates and visualizes missing values by column, with percent missing and optional flag encoding hints.

  • Collinearity
    Includes:

    • Pearson correlation heatmap
    • Variance Inflation Factor (VIF) plot to detect multicollinearity risks
  • Distribution Quality
    Visualizes skewness and potential distribution anomalies:

    • Target column distribution
    • Numerical feature histograms
    • Outlier detection using IQR boxplots
  • Evaluation Plots (via PlotViewer)
    This tab includes all advanced diagnostics plots:

    • VIF plot
    • Pearson heatmap
    • Outlier boxplots
    • Feature-wise skew distributions

These diagnostics are critical for spotting leakage, encoding flaws, and redundancy before any modeling occurs.

📊 Model Evaluation Dashboard¶

This dashboard provides an interactive summary of the model trained in the experimental run. It visualizes performance on the test set and includes explainability tools to support model diagnostics and stakeholder communication.

Summary
  • High-level performance metrics (e.g., R², MAE for regression or Accuracy, F1 for classification)
  • If cross-validation is enabled, a boxplot of fold-level scores is included
  • If a baseline model is configured, delta scores are annotated beside the champion’s metrics
  • Displays the same metrics for the baseline model
  • Highlights any drop or improvement when compared to the current champion
Importance
  • Feature importance scores from:
    • SHAP bar charts (if SHAP enabled)
    • Coefficients (for linear models)
    • Permutation importance (if enabled)
  • Useful for debugging and stakeholder reporting
Explainability
  • SHAP Impact Summary Plot for understanding global feature effects
  • Optional if SHAP is disabled in your config
Plotviewers

Model Performance Plots

  • Interactive evaluation visuals via the plot viewer widget:
    • ROC & PR curves (classification)
    • Residuals, prediction vs. truth (regression)
    • Confusion matrix, threshold plots, calibration, etc.

Distribution Plots

  • Always included
  • Shows feature distributions in the holdout set
  • Supports quick detection of skew, class imbalance, or feature leakage
Metadata
  • Full configuration summary:
    • YAML config snapshot
    • Model and version from MLflow
    • Holdout dataset used
    • Run ID and export paths
Alerts
  • Automated audit system that surfaces:
    • Warning thresholds (e.g., F1 below expected)
    • Cross-validation variance anomalies
    • Drift against baseline scores

📌 Core Imports¶

You can access the main runners directly from the package thanks to a clean interface exposed via __init__.py. These entrypoints allow you to run each stage of the pipeline from a single import.

In [1]:
from model_eval_suite import run_experiment, validate_champion, prep_data
In [2]:
import os
# Change the working directory to the project root
os.chdir('..')

📤 Data Preparation¶

If you're starting from raw CSVs, you can use the suite's built-in preprocessing tool, data_prep.py to split the data into training, testing, and holdout sets.

Skip this if you've already created your train.csv, test.csv, and holdout.csv.

In [3]:
prep_data(config_path="config/data_prep.yaml")
Loading raw data from: data/input_data/salifort_50k.csv
Performing initial holdout split...
Performing train/test split on development data...
✅ Train data saved to: data/dev_data/train_data.csv (30000 rows)
✅ Test data saved to: data/dev_data/test_data.csv (10000 rows)
✅ Holdout data saved to: data/holdout_data/holdout_data.csv (10000 rows)

⚙️ Model Experiment Runs (Demo)¶

This demo walks through multiple model runs using the salifort_50k dataset. Although the dataset is optimized for classification tasks, it is used for both classification and regression pipelines to demonstrate flexibility and YAML-driven control.

We run the following models using the evaluation suite:

🔍 Classifier Models¶

  • Gaussian Naive Bayes
    Config: config/classifier/guas_nb.yaml

  • Logistic Regression
    Config: config/classifier/logreg.yaml

  • XGBoost Classifier
    Config: config/classifier/xgboost.yaml

📈 Regressor Models¶

  • Linear Regression
    Config: config/regressor/linreg.yaml

  • XGBoost Regressor
    Config: config/regressor/xgboost_reg.yaml

Each model triggers:

  • An optional Pre-Model Diagnostics Dashboard (if enabled in YAML)
  • A complete Evaluation Dashboard with explainability, distributions, and exportable artifacts

At the end of the demo, we use the champion validation system to validate and crown the two XGBoost models — one for classification, one for regression.

All behavior is controlled by the YAML configs. See the config/ directory or config_resources/config.zip for template downloads.

In [4]:
# ========== Naive - Bayes ==========
run_experiment(user_config_path="config/classifier/guas_nb.yaml")
Registered model 'gnb_demo_01' already exists. Creating a new version of this model...
Created version '4' of model 'gnb_demo_01'.
⚠️ SHAP explainer creation failed: The passed model is not callable and cannot be analyzed directly with the given masker! Model: GaussianNB()
✅ Run complete: `gnb_demo_01`
📁 Artifacts saved to:
   - Plots:   exports/plots/gnb_demo_01
   - Reports: exports/reports/gnb_demo_01
📦 MLflow model: `gnb_demo_01`

--- Rendering Dashboards ---

⚠️ SHAP Error Handling¶

click to expand section

Some models — such as SVC, SVR, or GaussianNB — do not expose traditional feature importance attributes or are incompatible with SHAP explainability tools.

This suite handles such situations gracefully:

  • The SHAP tab will be skipped silently if no compatible features are found.
  • A warning will be logged (but not treated as a failure).
  • All other evaluation plots and metrics will still render normally.

This ensures that the workflow remains robust even for models with limited explainability tooling.

In [5]:
# ========== Logistic Regression ==========
run_experiment(user_config_path="config/classifier/logreg.yaml")
Registered model 'logreg_baseline_01' already exists. Creating a new version of this model...
Created version '4' of model 'logreg_baseline_01'.
✅ Run complete: `logreg_baseline_01`
📁 Artifacts saved to:
   - Plots:   exports/plots/logreg_baseline_01
   - Reports: exports/reports/logreg_baseline_01
📦 MLflow model: `logreg_baseline_01`

--- Rendering Dashboards ---
No description has been provided for this image
In [6]:
# ========== XGBoost Classifier ==========

run_experiment(user_config_path="config/classifier/xgboost.yaml")
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Registered model 'xgb_demo_01' already exists. Creating a new version of this model...
Created version '4' of model 'xgb_demo_01'.
✅ Run complete: `xgb_demo_01`
📁 Artifacts saved to:
   - Plots:   exports/plots/xgb_demo_01
   - Reports: exports/reports/xgb_demo_01
📦 MLflow model: `xgb_demo_01`

--- Rendering Dashboards ---
No description has been provided for this image

🔁 Cross-Validation Insight¶

click here to expand

If hyperparameter tuning via cross-validation is enabled in the config (hyperparameter_tuning.run: true), the dashboard will include an additional boxplot in the Summary tab.

This plot visualizes the distribution of CV scores across folds for the best-performing parameter set, offering a quick diagnostic of stability and performance variance.

In [7]:
# ========== Linear Regression ==========
run_experiment(user_config_path="config/regressor/linreg.yaml")
Registered model 'linreg_demo_01' already exists. Creating a new version of this model...
Created version '4' of model 'linreg_demo_01'.
✅ Run complete: `linreg_demo_01`
📁 Artifacts saved to:
   - Plots:   exports/plots/linreg_demo_01
   - Reports: exports/reports/linreg_demo_01
📦 MLflow model: `linreg_demo_01`

--- Rendering Dashboards ---
No description has been provided for this image

🚨 Automated Alert Auditing¶

click here to expand section

The validation dashboard includes an Alerts tab that surfaces automated audit checks on your model's performance.

These alerts are designed to flag potential concerns such as:

  • Very low precision or recall
  • High class imbalance
  • Overfitting indicators (e.g., large delta between train/test scores)
  • Underwhelming performance against a baseline (if provided)

This system provides a lightweight, interpretable review of model quality without requiring custom code or manual thresholding.

In [8]:
# ========== XGBoost Regressor ==========
run_experiment(user_config_path="config/regressor/xgboost_reg.yaml")
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Registered model 'xgbreg_demo_01' already exists. Creating a new version of this model...
Created version '4' of model 'xgbreg_demo_01'.
✅ Run complete: `xgbreg_demo_01`
📁 Artifacts saved to:
   - Plots:   exports/plots/xgbreg_demo_01
   - Reports: exports/reports/xgbreg_demo_01
📦 MLflow model: `xgbreg_demo_01`

--- Rendering Dashboards ---
No description has been provided for this image

🏆 Champion Model Validation¶

This section evaluates a registered MLflow model (your champion) against a holdout dataset using a dedicated validation YAML configuration.

Key Features¶

  • Uses its own standalone YAML file (separate from training experiments)
  • Accepts an optional baseline model for drift detection or performance benchmarking
  • Automatically generates:
    • Confidence interval plots (if applicable)
    • Baseline comparison deltas (if a baseline model is provided)
    • Alert audits for performance degradation or instability
  • Produces a complete interactive dashboard with:
    • Summary metrics and cross-validation visualizations
    • Explainability and feature importance plots
    • Distribution visualizations for target and predictions
    • Full configuration and environment metadata
  • Tags the evaluated model in the MLflow Registry using your specified production_tag

📍 This workflow is ideal for pre-deployment validation, regression testing, and model promotion decisions.

Validation Configurations Used in This Demo¶

  • config/xgb_validation.yaml – XGBoost classifier
  • config/xgb_reg_validation.yaml – XGBoost regressor
In [9]:
# ========== Validate Classifer Champion Model ==========
validate_champion(config_path="config/xgb_validation.yaml")
--- 🚀 Starting Champion Model Validation: xgb_demo_01_production_validation ---
Loading model from: models:/xgb_demo_01/1
Loading holdout data from: data/holdout_data/holdout_data.csv
Loading baseline model from: models:/logreg_baseline_01/1
Evaluating baseline model...
Detected task type: classification
Generating final assessment plots...
Exporting validation artifacts...
Tagging model version with status: 'Production-Candidate'
--- ✅ Validation Complete for xgb_demo_01 v1 ---
--- 📊 Rendering Validation Dashboard ---
No description has been provided for this image
In [10]:
# ========== Validate Regressor Champion Model ==========
validate_champion(config_path="config/xgb_reg_validation.yaml")
--- 🚀 Starting Champion Model Validation: xgbreg_demo_01_production_validation ---
Loading model from: models:/xgbreg_demo_01/1
Loading holdout data from: data/holdout_data/holdout_data.csv
Loading baseline model from: models:/linreg_demo_01/1
Evaluating baseline model...
Detected task type: regression
Generating final assessment plots...
⚠️ Prediction interval plot is only available for ensemble models like RandomForestRegressor.
Exporting validation artifacts...
Tagging model version with status: 'Production-Candidate'
--- ✅ Validation Complete for xgbreg_demo_01 v1 ---
--- 📊 Rendering Validation Dashboard ---
No description has been provided for this image

✅ Wrap-Up and Next Steps¶

You’ve now run multiple models through the full suite — from preprocessing and diagnostics to evaluation and champion validation.

This notebook demonstrates the flexibility of the system, including:

  • YAML-driven configuration at every stage
  • Reusable pipelines for both classification and regression tasks
  • Support for custom feature engineering and hyperparameter tuning
  • Interactive dashboards for diagnostics and final reporting
  • MLflow integration for model tracking and registry updates

Next Steps¶

  • Test additional models by duplicating a config YAML.
  • Customize features using your own transformers or fe_config modules.
  • Enable advanced diagnostics, SHAP, and permutation importance as needed.
  • Package and deploy validated models via the MLflow registry.

For more examples and config templates, explore the config_resources/ folder or the full README.md.

Questions or suggestions? Feel free to submit an issue or feature request in the repository.