Skip to Content
16 Model Evaluation08 Challenges

Challenges: Model Evaluation & Metrics

Complete these progressive challenges to master model evaluation techniques!


Challenge 1: Imbalanced Classification Metrics ⭐⭐

Difficulty: Beginner
Time: 30-45 minutes
Topics: Classification metrics, imbalanced data

Task

You’re building a fraud detection system where only 1% of transactions are fraudulent.

Dataset:

  • 10,000 transactions
  • 100 fraudulent (1%)
  • 9,900 legitimate (99%)

Your Tasks:

  1. Create a “dummy” classifier that always predicts “Not Fraud”

    • Calculate accuracy, precision, recall, F1
    • Explain why high accuracy is misleading
  2. Build a better classifier (any algorithm)

    • Use appropriate metrics (F1, ROC-AUC, PR-AUC)
    • Create confusion matrix visualization
    • Calculate precision@K for different K values
  3. Compare the two classifiers

    • Which metric best shows improvement?
    • What threshold would you recommend?

Success Criteria

  • Dummy classifier implemented
  • Demonstrate accuracy paradox
  • Better classifier with F1 > 0.50
  • ROC and PR curves created
  • Threshold analysis completed
  • Written justification (200+ words)

Learning Objectives

  • Understanding accuracy limitations
  • Choosing metrics for imbalanced data
  • Threshold optimization
  • Precision-recall trade-offs

Challenge 2: Regression Error Analysis ⭐⭐⭐

Difficulty: Intermediate
Time: 1-2 hours
Topics: Regression metrics, residual analysis

Task

Build a house price prediction model and perform comprehensive error analysis.

Dataset: Use California Housing or Boston Housing dataset

Your Tasks:

  1. Train 3 regression models:

    • Linear Regression
    • Random Forest
    • Gradient Boosting
  2. Calculate metrics:

    • MAE, RMSE, R², MAPE
    • Compare MAE/RMSE ratio (detect outliers)
    • Calculate by price range (low/mid/high)
  3. Residual analysis:

    • Plot residuals vs predicted
    • Check normality (Q-Q plot, Shapiro-Wilk test)
    • Identify heteroscedasticity
    • Find worst predictions
  4. Error breakdown:

    • Errors by neighborhood/location
    • Errors by price range
    • Identify systematic errors

Success Criteria

  • 3 models trained and compared
  • All metrics calculated
  • Residual plots created (4+ plots)
  • Outliers identified and analyzed
  • Systematic errors documented
  • Model improvement recommendations (300+ words)

Learning Objectives

  • Regression metric selection
  • Residual diagnostics
  • Outlier detection
  • Model debugging

Challenge 3: LLM Output Evaluation ⭐⭐⭐

Difficulty: Intermediate
Time: 2-3 hours
Topics: BLEU, ROUGE, BERTScore, semantic similarity

Task

Compare different LLM outputs for a summarization task.

Dataset: Create or use:

  • CNN/DailyMail summaries
  • XSum dataset
  • Or generate 20+ article-summary pairs

Your Tasks:

  1. Generate summaries from 3 different approaches:

    • Extractive (select key sentences)
    • Rule-based (heuristics)
    • LLM-based (GPT/Claude if available, or use pre-generated)
  2. Calculate metrics:

    • BLEU (1-gram through 4-gram)
    • ROUGE (ROUGE-1, ROUGE-2, ROUGE-L)
    • BERTScore (if possible)
  3. Analysis:

    • Which metric correlates best with quality?
    • Find examples where BLEU is misleading
    • Compare lexical (BLEU/ROUGE) vs semantic (BERTScore)
  4. Human evaluation:

    • Create rubric (fluency, coherence, relevance)
    • Evaluate 10 summaries manually
    • Compare automated vs human scores

Success Criteria

  • 3 summarization approaches implemented
  • BLEU and ROUGE scores calculated
  • BERTScore calculated (or alternative semantic metric)
  • Human evaluation completed (10+ samples)
  • Correlation analysis between metrics
  • Findings report (400+ words)

Learning Objectives

  • LLM evaluation techniques
  • Metric limitations
  • Semantic vs lexical matching
  • Human evaluation design

Challenge 4: Bias Detection & Measurement ⭐⭐⭐⭐

Difficulty: Advanced
Time: 3-4 hours
Topics: Fairness metrics, bias detection, group analysis

Task

Audit a hiring/lending model for bias across protected groups.

Dataset: Use:

  • UCI Adult Income dataset
  • German Credit dataset
  • COMPAS recidivism data (if available)
  • Or synthetic dataset with known bias

Your Tasks:

  1. Data analysis:

    • Document class distribution by protected group
    • Statistical tests for independence
    • Feature correlation with protected attributes
  2. Train biased model:

    • Any classifier
    • Evaluate overall performance
    • Calculate group-wise metrics
  3. Fairness metrics:

    • Demographic parity difference/ratio
    • Equalized odds difference
    • Equal opportunity difference
    • Check 80% rule
  4. Disparate impact analysis:

    • Confusion matrices by group
    • FPR and FNR by group
    • Precision and recall by group
    • Visualize disparities
  5. Bias mitigation:

    • Implement 2 mitigation techniques
    • Compare fairness before/after
    • Document accuracy-fairness trade-off

Success Criteria

  • Comprehensive bias audit completed
  • 5+ fairness metrics calculated
  • Group-wise performance analyzed
  • Statistical significance tested
  • 2 mitigation techniques applied
  • Trade-off analysis documented
  • Report with recommendations (600+ words)

Learning Objectives

  • Fairness metric calculation
  • Bias detection in practice
  • Mitigation techniques
  • Accuracy-fairness trade-offs
  • Ethical AI considerations

Challenge 5: Statistical Model Comparison ⭐⭐⭐⭐

Difficulty: Advanced
Time: 3-4 hours
Topics: Cross-validation, statistical tests, significance testing

Task

Rigorously compare 5+ models with statistical validation.

Dataset: Any classification or regression dataset (1000+ samples)

Your Tasks:

  1. Model training:

    • Train 5 different model types
    • Use stratified 10-fold cross-validation
    • Track all metrics across folds
  2. Statistical testing:

    • Paired t-tests (all pairwise comparisons)
    • McNemar’s test (classification)
    • Create significance matrix
    • Bonferroni correction for multiple comparisons
  3. Confidence intervals:

    • Calculate 95% CI for each model
    • Bootstrap confidence intervals
    • Visualize with error bars
  4. Power analysis:

    • Calculate statistical power
    • Determine minimum sample size
    • Sensitivity analysis
  5. Learning curves:

    • Plot for all models
    • Identify overfitting/underfitting
    • Recommend training data size

Success Criteria

  • 5+ models compared
  • 10-fold cross-validation used
  • Statistical tests performed (10+ comparisons)
  • Significance matrix created
  • Confidence intervals calculated
  • Learning curves generated
  • Power analysis completed
  • Detailed methodology report (500+ words)

Learning Objectives

  • Rigorous model comparison
  • Statistical hypothesis testing
  • Multiple testing corrections
  • Power and sample size analysis
  • Scientific method in ML

Challenge 6: A/B Testing Simulation ⭐⭐⭐⭐⭐

Difficulty: Expert
Time: 4-6 hours
Topics: A/B testing, production evaluation, sequential testing

Task

Design and simulate a complete A/B test for model deployment.

Scenario:

  • Current model (A) in production
  • New model (B) to test
  • Simulate 10,000 user interactions

Your Tasks:

  1. Experimental design:

    • Define primary and secondary metrics
    • Calculate required sample size
    • Design randomization scheme
    • Set up guardrail metrics
  2. Simulation:

    • Generate synthetic user interactions
    • Randomly assign to A or B (50/50)
    • Track metrics over time
    • Simulate various scenarios (B wins, loses, tie)
  3. Sequential analysis:

    • Implement sequential probability ratio test
    • Early stopping rules
    • Monitor p-values over time
    • Handle peeking problem
  4. Results analysis:

    • Statistical significance test
    • Confidence intervals for lift
    • Heterogeneous treatment effects (if applicable)
    • Cost-benefit analysis
  5. Monitoring dashboard:

    • Create visualizations for stakeholders
    • Real-time metric tracking
    • Decision framework
    • Rollout plan

Success Criteria

  • Sample size calculation correct
  • A/B test simulation implemented
  • Sequential testing applied
  • 3+ scenarios tested (win/lose/tie)
  • Statistical analysis complete
  • Dashboard mockup created
  • Rollout plan documented
  • Comprehensive report (800+ words)

Learning Objectives

  • A/B test design
  • Sequential hypothesis testing
  • Production ML evaluation
  • Stakeholder communication
  • Decision-making under uncertainty

Challenge 7: Multi-Objective Model Selection ⭐⭐⭐⭐⭐

Difficulty: Expert
Time: 4-6 hours
Topics: Pareto optimality, trade-off analysis, decision making

Task

Select the best model when objectives conflict (accuracy vs fairness vs speed).

Dataset: Any real-world dataset with protected attributes

Your Tasks:

  1. Train diverse model zoo (8+ models):

    • Various complexity levels
    • Different algorithms
    • Measure: accuracy, fairness, speed, memory, interpretability
  2. Pareto frontier:

    • Identify Pareto-optimal models
    • Visualize in 2D/3D
    • Eliminate dominated models
  3. Multi-criteria decision analysis:

    • Weighted sum approach
    • TOPSIS method
    • Analytic Hierarchy Process (AHP)
  4. Sensitivity analysis:

    • Test different weight configurations
    • Identify robust choices
    • Scenario planning (accuracy-focused, fairness-focused, balanced)
  5. Stakeholder analysis:

    • Define 3 stakeholder profiles
    • Recommend model for each
    • Document trade-offs

Success Criteria

  • 8+ models trained
  • 5+ objectives measured
  • Pareto frontier identified
  • 3 MCDA methods applied
  • Sensitivity analysis complete
  • Stakeholder recommendations made
  • Interactive visualization created
  • Decision framework documented (1000+ words)

Learning Objectives

  • Multi-objective optimization
  • Pareto optimality
  • Decision analysis techniques
  • Stakeholder management
  • Real-world ML deployment

🏆 Challenge Completion Tracker

ChallengeStatusDateNotes
1. Imbalanced Classification
2. Regression Error Analysis
3. LLM Output Evaluation
4. Bias Detection
5. Statistical Comparison
6. A/B Testing
7. Multi-Objective Selection

💡 General Tips

  1. Start simple: Begin with basic versions, then enhance
  2. Document everything: Explain your choices and interpret results
  3. Visualize: Create clear, professional plots
  4. Test edge cases: Don’t just test the happy path
  5. Seek feedback: Share results with peers or mentors

Complete all 7 challenges to become a model evaluation expert! 🎯

Last updated on