Debugging & Troubleshooting

Learn systematic approaches to debug, diagnose, and optimize machine learning models and pipelines.

🎯 Learning Objectives

By the end of this phase, you will:

✅ Diagnose common ML model failures and performance issues
✅ Use debugging tools and techniques for ML workflows
✅ Profile and optimize model performance (speed, memory)
✅ Analyze and fix data-related problems
✅ Implement error handling and monitoring
✅ Debug deep learning models effectively
✅ Troubleshoot deployment and production issues

📚 What You’ll Learn

1. Model Debugging Fundamentals

The ML debugging workflow
Common failure modes and symptoms
Debugging checklist and best practices
Logging and instrumentation

2. Data Issues Diagnosis

Data quality problems (missing, duplicates, outliers)
Label errors and class imbalance
Distribution shift detection
Feature correlation issues

3. Performance Optimization

Profiling CPU and memory usage
Identifying bottlenecks
Optimization techniques (vectorization, caching)
GPU utilization monitoring

4. Model-Specific Debugging

Gradient vanishing/exploding
Overfitting vs underfitting diagnosis
Learning curve analysis
Activation and weight monitoring

5. Error Analysis Framework

Systematic error categorization
Confusion matrix deep dive
Per-class error analysis
Failure case collection

🗂️ Module Structure

Notebooks

02_debugging_workflow.ipynb
- ML debugging methodology
- Sanity checks and baseline models
- Common pitfalls checklist
- Debugging tools overview
- Duration: 60-90 minutes
03_data_issues.ipynb
- Data quality checks (null values, duplicates, outliers)
- Label noise detection
- Distribution shift analysis
- Feature validation
- Duration: 60-90 minutes
04_performance_profiling.ipynb
- CPU profiling with cProfile and line_profiler
- Memory profiling with memory_profiler
- Bottleneck identification
- Optimization strategies
- Duration: 90-120 minutes
05_model_debugging.ipynb
- Learning curves and convergence
- Gradient monitoring
- Weight initialization issues
- Overfitting/underfitting diagnosis
- Duration: 90-120 minutes
06_error_analysis.ipynb
- Systematic error categorization
- Per-class performance analysis
- Failure case analysis
- Improvement strategies
- Duration: 60-90 minutes

Supporting Materials

08_assignment.md - Comprehensive debugging project
10_challenges.md - 7 progressive debugging challenges
15_pre-quiz.md - Baseline knowledge assessment
14_post-quiz.md - Final knowledge verification

🛠️ Tools & Libraries

Profiling Tools


import cProfile          # CPU profiling
import line_profiler     # Line-by-line profiling
import memory_profiler   # Memory usage tracking
import py-spy            # Sampling profiler

Debugging Tools


import pdb              # Python debugger
import logging          # Structured logging
import warnings         # Warning management
import traceback        # Stack trace analysis

Visualization Tools


import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import tensorboard      # For deep learning

Analysis Libraries


import pandas as pd
import numpy as np
import scikit-learn
from scipy import stats

📋 Prerequisites

Before starting this phase, you should:

✅ Complete Phase 16 (Model Evaluation & Metrics)
✅ Understand model training basics
✅ Be familiar with Python debugging
✅ Know NumPy and pandas fundamentals

🎓 Learning Path

Week 1: Fundamentals & Data Issues

Day 1-2: Complete Notebook 1 (Debugging Workflow)
Day 3-4: Complete Notebook 2 (Data Issues)
Day 5: Practice Challenges 1-2

Week 2: Performance & Model Debugging

Day 1-2: Complete Notebook 3 (Performance Profiling)
Day 3-4: Complete Notebook 4 (Model Debugging)
Day 5: Practice Challenges 3-4

Week 3: Error Analysis & Integration

Day 1-2: Complete Notebook 5 (Error Analysis)
Day 3-4: Complete Assignment
Day 5: Practice Challenges 5-7, Final Review

🚀 Quick Start

Installation


# Install profiling tools
pip install line-profiler memory-profiler py-spy
 
# Install debugging utilities
pip install ipdb pytest-sugar
 
# Install visualization tools
pip install matplotlib seaborn plotly tensorboard
 
# Install ML libraries
pip install scikit-learn pandas numpy scipy

Jupyter Extensions


# Install useful Jupyter extensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
 
# Enable profiling extension
pip install jupyter-resource-usage

🎯 Common Debugging Scenarios

Scenario 1: Model Not Learning

Symptoms: Loss not decreasing, random performance Checklist:

✅ Check learning rate (too high/low?)
✅ Verify data loading (labels shuffled?)
✅ Inspect gradients (vanishing/exploding?)
✅ Test with smaller dataset first

Scenario 2: Poor Validation Performance

Symptoms: Train accuracy high, validation accuracy low Checklist:

✅ Overfitting? Add regularization
✅ Data leakage? Check feature engineering
✅ Distribution shift? Analyze train/val split
✅ Insufficient data? Try augmentation

Scenario 3: Slow Training

Symptoms: Training takes forever Checklist:

✅ Profile code to find bottlenecks
✅ Check data loading (I/O bound?)
✅ Optimize preprocessing (use vectorization)
✅ Use GPU acceleration if available

Scenario 4: Memory Errors

Symptoms: Out of memory crashes Checklist:

✅ Profile memory usage
✅ Reduce batch size
✅ Use gradient accumulation
✅ Clear cache between iterations

📊 Debugging Workflow Diagram

🔍 Key Debugging Techniques

1. Sanity Checks


# Check data shapes
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
 
# Verify label distribution
print(y_train.value_counts())
 
# Test on single batch
model.fit(X_train[:32], y_train[:32])

2. Baseline Models


# Always start with simple baseline
from sklearn.dummy import DummyClassifier
baseline = DummyClassifier(strategy='most_frequent')
baseline.fit(X_train, y_train)
print(f"Baseline accuracy: {baseline.score(X_test, y_test):.3f}")

3. Logging


import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
logger.info(f"Training started with {len(X_train)} samples")
logger.warning(f"Found {null_count} missing values")

4. Visualization


# Plot learning curves
plt.plot(history['train_loss'], label='Train')
plt.plot(history['val_loss'], label='Validation')
plt.legend()
plt.show()
 
# Inspect predictions
pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}).head(20)

⚠️ Common Pitfalls

❌ Data leakage (using test data in preprocessing)
❌ Label mismatch (features and labels not aligned)
❌ Scaling issues (forgot to normalize/standardize)
❌ Class imbalance (not handling imbalanced classes)

❌ Wrong loss function (MSE for classification)
❌ Learning rate issues (too high causes divergence)
❌ Random seed not set (can’t reproduce results)
❌ No validation set (overfitting to test set)

❌ Shape mismatches (batch dimension issues)
❌ Gradient accumulation (not zeroing gradients)
❌ Mode confusion (model in train vs eval mode)
❌ Memory leaks (holding references to tensors)

📈 Success Metrics

After completing this phase, you should be able to:

✅ Systematically debug ML models using structured approach
✅ Identify and fix data quality issues
✅ Profile and optimize code for 2-10x speedup
✅ Diagnose overfitting, underfitting, and convergence problems
✅ Perform thorough error analysis
✅ Use debugging tools (pdb, profilers, loggers)
✅ Handle common production issues

🎯 Assessment

Pre-Quiz: Baseline knowledge check (10 questions)
Notebooks: 5 interactive debugging exercises
Assignment: Debug and optimize a broken ML pipeline
Challenges: 7 real-world debugging scenarios
Post-Quiz: Comprehensive assessment (18 questions)

Passing Criteria: 70% on post-quiz, complete assignment

📚 Additional Resources

Documentation

Articles

“Debugging Machine Learning Models” - Papers with Code
“A Recipe for Training Neural Networks” - Andrej Karpathy
“Troubleshooting Deep Neural Networks” - Josh Tobin

Tools

TensorBoard: Visualization for deep learning
Weights & Biases: Experiment tracking
Neptune.ai: ML metadata store

What Comes Next

After this phase:

Phase 18: Low-Code AI Tools (Gradio, Streamlit)
Phase 19: AI Safety & Red Teaming
Advanced: MLOps and Model Monitoring

💡 Tips for Success

Be Systematic: Follow the debugging workflow, don’t guess randomly
Start Simple: Test with minimal data first, then scale up
Document Everything: Keep notes on what you tried and results
Use Version Control: Commit working code before experimenting
Ask for Help: Share reproducible examples when stuck

Happy Debugging! Remember: Every bug is an opportunity to learn! 🐛→🦋