Challenges: Model Evaluation & Metrics
Complete these progressive challenges to master model evaluation techniques!
Challenge 1: Imbalanced Classification Metrics ⭐⭐
Difficulty: Beginner
Time: 30-45 minutes
Topics: Classification metrics, imbalanced data
Task
You’re building a fraud detection system where only 1% of transactions are fraudulent.
Dataset:
- 10,000 transactions
- 100 fraudulent (1%)
- 9,900 legitimate (99%)
Your Tasks:
-
Create a “dummy” classifier that always predicts “Not Fraud”
- Calculate accuracy, precision, recall, F1
- Explain why high accuracy is misleading
-
Build a better classifier (any algorithm)
- Use appropriate metrics (F1, ROC-AUC, PR-AUC)
- Create confusion matrix visualization
- Calculate precision@K for different K values
-
Compare the two classifiers
- Which metric best shows improvement?
- What threshold would you recommend?
Success Criteria
- Dummy classifier implemented
- Demonstrate accuracy paradox
- Better classifier with F1 > 0.50
- ROC and PR curves created
- Threshold analysis completed
- Written justification (200+ words)
Learning Objectives
- Understanding accuracy limitations
- Choosing metrics for imbalanced data
- Threshold optimization
- Precision-recall trade-offs
Challenge 2: Regression Error Analysis ⭐⭐⭐
Difficulty: Intermediate
Time: 1-2 hours
Topics: Regression metrics, residual analysis
Task
Build a house price prediction model and perform comprehensive error analysis.
Dataset: Use California Housing or Boston Housing dataset
Your Tasks:
-
Train 3 regression models:
- Linear Regression
- Random Forest
- Gradient Boosting
-
Calculate metrics:
- MAE, RMSE, R², MAPE
- Compare MAE/RMSE ratio (detect outliers)
- Calculate by price range (low/mid/high)
-
Residual analysis:
- Plot residuals vs predicted
- Check normality (Q-Q plot, Shapiro-Wilk test)
- Identify heteroscedasticity
- Find worst predictions
-
Error breakdown:
- Errors by neighborhood/location
- Errors by price range
- Identify systematic errors
Success Criteria
- 3 models trained and compared
- All metrics calculated
- Residual plots created (4+ plots)
- Outliers identified and analyzed
- Systematic errors documented
- Model improvement recommendations (300+ words)
Learning Objectives
- Regression metric selection
- Residual diagnostics
- Outlier detection
- Model debugging
Challenge 3: LLM Output Evaluation ⭐⭐⭐
Difficulty: Intermediate
Time: 2-3 hours
Topics: BLEU, ROUGE, BERTScore, semantic similarity
Task
Compare different LLM outputs for a summarization task.
Dataset: Create or use:
- CNN/DailyMail summaries
- XSum dataset
- Or generate 20+ article-summary pairs
Your Tasks:
-
Generate summaries from 3 different approaches:
- Extractive (select key sentences)
- Rule-based (heuristics)
- LLM-based (GPT/Claude if available, or use pre-generated)
-
Calculate metrics:
- BLEU (1-gram through 4-gram)
- ROUGE (ROUGE-1, ROUGE-2, ROUGE-L)
- BERTScore (if possible)
-
Analysis:
- Which metric correlates best with quality?
- Find examples where BLEU is misleading
- Compare lexical (BLEU/ROUGE) vs semantic (BERTScore)
-
Human evaluation:
- Create rubric (fluency, coherence, relevance)
- Evaluate 10 summaries manually
- Compare automated vs human scores
Success Criteria
- 3 summarization approaches implemented
- BLEU and ROUGE scores calculated
- BERTScore calculated (or alternative semantic metric)
- Human evaluation completed (10+ samples)
- Correlation analysis between metrics
- Findings report (400+ words)
Learning Objectives
- LLM evaluation techniques
- Metric limitations
- Semantic vs lexical matching
- Human evaluation design
Challenge 4: Bias Detection & Measurement ⭐⭐⭐⭐
Difficulty: Advanced
Time: 3-4 hours
Topics: Fairness metrics, bias detection, group analysis
Task
Audit a hiring/lending model for bias across protected groups.
Dataset: Use:
- UCI Adult Income dataset
- German Credit dataset
- COMPAS recidivism data (if available)
- Or synthetic dataset with known bias
Your Tasks:
-
Data analysis:
- Document class distribution by protected group
- Statistical tests for independence
- Feature correlation with protected attributes
-
Train biased model:
- Any classifier
- Evaluate overall performance
- Calculate group-wise metrics
-
Fairness metrics:
- Demographic parity difference/ratio
- Equalized odds difference
- Equal opportunity difference
- Check 80% rule
-
Disparate impact analysis:
- Confusion matrices by group
- FPR and FNR by group
- Precision and recall by group
- Visualize disparities
-
Bias mitigation:
- Implement 2 mitigation techniques
- Compare fairness before/after
- Document accuracy-fairness trade-off
Success Criteria
- Comprehensive bias audit completed
- 5+ fairness metrics calculated
- Group-wise performance analyzed
- Statistical significance tested
- 2 mitigation techniques applied
- Trade-off analysis documented
- Report with recommendations (600+ words)
Learning Objectives
- Fairness metric calculation
- Bias detection in practice
- Mitigation techniques
- Accuracy-fairness trade-offs
- Ethical AI considerations
Challenge 5: Statistical Model Comparison ⭐⭐⭐⭐
Difficulty: Advanced
Time: 3-4 hours
Topics: Cross-validation, statistical tests, significance testing
Task
Rigorously compare 5+ models with statistical validation.
Dataset: Any classification or regression dataset (1000+ samples)
Your Tasks:
-
Model training:
- Train 5 different model types
- Use stratified 10-fold cross-validation
- Track all metrics across folds
-
Statistical testing:
- Paired t-tests (all pairwise comparisons)
- McNemar’s test (classification)
- Create significance matrix
- Bonferroni correction for multiple comparisons
-
Confidence intervals:
- Calculate 95% CI for each model
- Bootstrap confidence intervals
- Visualize with error bars
-
Power analysis:
- Calculate statistical power
- Determine minimum sample size
- Sensitivity analysis
-
Learning curves:
- Plot for all models
- Identify overfitting/underfitting
- Recommend training data size
Success Criteria
- 5+ models compared
- 10-fold cross-validation used
- Statistical tests performed (10+ comparisons)
- Significance matrix created
- Confidence intervals calculated
- Learning curves generated
- Power analysis completed
- Detailed methodology report (500+ words)
Learning Objectives
- Rigorous model comparison
- Statistical hypothesis testing
- Multiple testing corrections
- Power and sample size analysis
- Scientific method in ML
Challenge 6: A/B Testing Simulation ⭐⭐⭐⭐⭐
Difficulty: Expert
Time: 4-6 hours
Topics: A/B testing, production evaluation, sequential testing
Task
Design and simulate a complete A/B test for model deployment.
Scenario:
- Current model (A) in production
- New model (B) to test
- Simulate 10,000 user interactions
Your Tasks:
-
Experimental design:
- Define primary and secondary metrics
- Calculate required sample size
- Design randomization scheme
- Set up guardrail metrics
-
Simulation:
- Generate synthetic user interactions
- Randomly assign to A or B (50/50)
- Track metrics over time
- Simulate various scenarios (B wins, loses, tie)
-
Sequential analysis:
- Implement sequential probability ratio test
- Early stopping rules
- Monitor p-values over time
- Handle peeking problem
-
Results analysis:
- Statistical significance test
- Confidence intervals for lift
- Heterogeneous treatment effects (if applicable)
- Cost-benefit analysis
-
Monitoring dashboard:
- Create visualizations for stakeholders
- Real-time metric tracking
- Decision framework
- Rollout plan
Success Criteria
- Sample size calculation correct
- A/B test simulation implemented
- Sequential testing applied
- 3+ scenarios tested (win/lose/tie)
- Statistical analysis complete
- Dashboard mockup created
- Rollout plan documented
- Comprehensive report (800+ words)
Learning Objectives
- A/B test design
- Sequential hypothesis testing
- Production ML evaluation
- Stakeholder communication
- Decision-making under uncertainty
Challenge 7: Multi-Objective Model Selection ⭐⭐⭐⭐⭐
Difficulty: Expert
Time: 4-6 hours
Topics: Pareto optimality, trade-off analysis, decision making
Task
Select the best model when objectives conflict (accuracy vs fairness vs speed).
Dataset: Any real-world dataset with protected attributes
Your Tasks:
-
Train diverse model zoo (8+ models):
- Various complexity levels
- Different algorithms
- Measure: accuracy, fairness, speed, memory, interpretability
-
Pareto frontier:
- Identify Pareto-optimal models
- Visualize in 2D/3D
- Eliminate dominated models
-
Multi-criteria decision analysis:
- Weighted sum approach
- TOPSIS method
- Analytic Hierarchy Process (AHP)
-
Sensitivity analysis:
- Test different weight configurations
- Identify robust choices
- Scenario planning (accuracy-focused, fairness-focused, balanced)
-
Stakeholder analysis:
- Define 3 stakeholder profiles
- Recommend model for each
- Document trade-offs
Success Criteria
- 8+ models trained
- 5+ objectives measured
- Pareto frontier identified
- 3 MCDA methods applied
- Sensitivity analysis complete
- Stakeholder recommendations made
- Interactive visualization created
- Decision framework documented (1000+ words)
Learning Objectives
- Multi-objective optimization
- Pareto optimality
- Decision analysis techniques
- Stakeholder management
- Real-world ML deployment
🏆 Challenge Completion Tracker
| Challenge | Status | Date | Notes |
|---|---|---|---|
| 1. Imbalanced Classification | ⬜ | ||
| 2. Regression Error Analysis | ⬜ | ||
| 3. LLM Output Evaluation | ⬜ | ||
| 4. Bias Detection | ⬜ | ||
| 5. Statistical Comparison | ⬜ | ||
| 6. A/B Testing | ⬜ | ||
| 7. Multi-Objective Selection | ⬜ |
💡 General Tips
- Start simple: Begin with basic versions, then enhance
- Document everything: Explain your choices and interpret results
- Visualize: Create clear, professional plots
- Test edge cases: Don’t just test the happy path
- Seek feedback: Share results with peers or mentors
Complete all 7 challenges to become a model evaluation expert! 🎯