Skip to Content
03 MathsIslp Book

Introduction to Statistical Learning with Python (ISLP)

A comprehensive collection of Jupyter notebooks covering the foundational concepts and advanced techniques in statistical learning and machine learning, based on β€œAn Introduction to Statistical Learning” adapted for Python.

Use this folder when you want a statistics-first view of machine learning rather than a purely optimization-first or deep-learning-first path. It is especially useful for classical modeling, validation, interpretability, and statistical decision making.

πŸ“š Overview

This series provides hands-on implementations of statistical learning methods with Python, featuring:

  • 13 comprehensive chapters covering fundamental to advanced topics
  • Theory + Practice: Mathematical formulations with executable code
  • Real datasets: Practical examples using classic ML datasets
  • Visualizations: Professional plots for understanding concepts
  • Exercises: Practice problems for each chapter
  • 100+ additional exercises: See PRACTICE_EXERCISES.ipynb

πŸ—‚οΈ Chapter Guide

Chapter 1: Introduction

File: 01_introduction.ipynb (NEW!)

Topics:

  • What is statistical learning?
  • Supervised vs unsupervised learning
  • Real-world applications
  • The ML workflow
  • Model assessment metrics
  • Overfitting vs underfitting

Demonstrations:

  • California Housing (regression example)
  • Breast Cancer (classification example)
  • Iris Clustering (unsupervised example)
  • Train-test split and overfitting visualization
  • Comprehensive metrics comparison

Key Concepts:

  • The learning framework: Y = f(X) + Ξ΅
  • Bias-variance trade-off
  • Train-test split importance
  • Regression vs classification metrics
  • Supervised vs unsupervised paradigms

Practice: 8 comprehensive exercises covering all intro concepts


Chapter 2: Statistical Learning

File: 02_statistical_learning.ipynb (25KB)

Topics:

  • Supervised vs unsupervised learning
  • Regression vs classification
  • Bias-variance trade-off
  • Training vs test error
  • Model assessment and selection

Key Concepts:

  • Reducible vs irreducible error
  • Overfitting and underfitting
  • Cross-validation basics

Chapter 3: Linear Regression

File: 03_linear_regression.ipynb (40KB)

Topics:

  • Simple linear regression
  • Multiple linear regression
  • Least squares estimation
  • Hypothesis testing (t-tests, F-tests)
  • RΒ² and adjusted RΒ²
  • Residual analysis

Demonstrations:

  • Boston Housing dataset
  • Advertising dataset
  • Confidence vs prediction intervals
  • Diagnostic plots

Key Formulas:

Ξ²Μ‚ = (X'X)⁻¹X'y RSS = Ξ£(yi - Ε·i)Β² RΒ² = 1 - RSS/TSS

Chapter 4: Classification

File: 04_classification.ipynb (32KB)

Topics:

  • Logistic regression
  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)
  • Naive Bayes
  • K-Nearest Neighbors (KNN)

Demonstrations:

  • Binary classification (Default dataset)
  • Multi-class classification (Iris)
  • Decision boundaries
  • Confusion matrices
  • ROC curves and AUC

Metrics:

  • Accuracy, precision, recall, F1-score
  • Sensitivity and specificity
  • Classification error rate

Chapter 5: Resampling Methods

File: 05_resampling_methods.ipynb (37KB)

Topics:

  • Cross-validation (LOOCV, k-fold)
  • Bootstrap
  • Model selection
  • Uncertainty estimation

Demonstrations:

  • k-fold CV for polynomial degree selection
  • LOOCV vs k-fold comparison
  • Bootstrap confidence intervals
  • Bootstrap standard errors

Applications:

  • Estimating test error
  • Model comparison
  • Parameter uncertainty
  • Sample size effects

Chapter 6: Linear Model Selection and Regularization

File: 06_regularization.ipynb (46KB)

Topics:

  • Subset selection (best subset, forward, backward)
  • Ridge regression (L2 penalty)
  • Lasso regression (L1 penalty)
  • Elastic Net
  • Principal Component Regression (PCR)

Key Concepts:

  • Regularization path
  • Cross-validation for Ξ» selection
  • Feature selection vs shrinkage
  • Multicollinearity handling

Formulas:

Ridge: minimize RSS + λΣβjΒ² Lasso: minimize RSS + λΣ|Ξ²j| Elastic Net: minimize RSS + λ₁Σ|Ξ²j| + λ₂ΣβjΒ²

Chapter 7: Moving Beyond Linearity

File: 07_nonlinearity.ipynb (35KB)

Topics:

  • Polynomial regression
  • Step functions
  • Regression splines (B-splines)
  • Smoothing splines
  • Generalized Additive Models (GAMs)

Demonstrations:

  • Polynomial degree selection via CV
  • Spline knot placement
  • GAMs with multiple predictors
  • Method comparison

Use Cases:

  • Non-linear relationships
  • Flexible modeling
  • Interpretable non-linearity
  • Smooth curve fitting

Chapter 8: Tree-Based Methods

File: 08_tree_methods.ipynb (40-45KB)

Topics:

  • Decision trees (CART)
  • Bagging (Bootstrap Aggregation)
  • Random Forests
  • Boosting (AdaBoost, Gradient Boosting)
  • Feature importance

Demonstrations:

  • Tree pruning and depth control
  • Out-of-bag (OOB) error
  • Feature importance visualization
  • Ensemble comparison
  • Breast Cancer dataset

Key Algorithms:

  • DecisionTreeClassifier/Regressor
  • BaggingClassifier
  • RandomForestClassifier
  • AdaBoostClassifier
  • GradientBoostingClassifier

Chapter 9: Support Vector Machines

File: 09_support_vector_machines.ipynb (40-45KB)

Topics:

  • Maximal margin classifier
  • Support Vector Classifier (soft margin)
  • Kernel methods (Linear, Polynomial, RBF)
  • Multi-class SVMs
  • Support Vector Regression (SVR)

Demonstrations:

  • C parameter tuning (margin control)
  • Kernel comparison
  • Gamma parameter effects (RBF)
  • Hyperparameter grid search
  • Breast Cancer classification

Key Concepts:

  • Margin maximization
  • Support vectors
  • Kernel trick
  • Slack variables (ΞΎ)

Chapter 10: Deep Learning

File: 10_deep_learning.ipynb (35-40KB)

Topics:

  • Neural network fundamentals
  • Activation functions (ReLU, sigmoid, tanh, softmax)
  • Single vs deep networks
  • Backpropagation
  • Regularization (L2, dropout, early stopping)
  • MLPClassifier and MLPRegressor

Demonstrations:

  • California Housing (regression)
  • MNIST digit classification
  • Hidden layer comparison
  • Learning curves
  • Regularization effects

Architecture:

Input β†’ Hidden₁ β†’ Hiddenβ‚‚ β†’ ... β†’ Output Each layer: z = Wx + b, a = Οƒ(z)

Chapter 11: Survival Analysis

File: 11_survival_analysis.ipynb (45-50KB)

Topics:

  • Survival functions
  • Censoring (right, left, interval)
  • Kaplan-Meier estimator
  • Log-Rank test
  • Cox Proportional Hazards model
  • Hazard ratios

Demonstrations:

  • Rossi recidivism dataset
  • Survival curves by group
  • Median survival time
  • Hazard ratio interpretation
  • Proportional hazards assumption

Key Library: lifelines

Applications:

  • Time-to-event analysis
  • Medical studies
  • Customer churn
  • Equipment failure

Chapter 12: Unsupervised Learning

File: 12_unsupervised_learning.ipynb (40-45KB)

Topics:

  • Principal Component Analysis (PCA)
  • K-Means clustering
  • Hierarchical clustering
  • DBSCAN
  • Dimensionality reduction

Demonstrations:

  • Iris dataset (PCA: 4D β†’ 2D)
  • Scree plots and variance explained
  • Elbow method for K selection
  • Silhouette analysis
  • Dendrogram visualization
  • MNIST digits clustering

Linkage Methods:

  • Complete
  • Average
  • Single
  • Ward

Validation:

  • Silhouette score
  • Davies-Bouldin index
  • Inertia (within-cluster SS)

Chapter 13: Multiple Testing

File: 13_multiple_testing.ipynb (35-40KB)

Topics:

  • Multiple testing problem
  • Family-Wise Error Rate (FWER)
  • False Discovery Rate (FDR)
  • Bonferroni correction
  • Holm’s method
  • Benjamini-Hochberg procedure
  • Benjamini-Yekutieli procedure

Demonstrations:

  • Simulation of Type I error inflation
  • FWER vs FDR comparison
  • Threshold visualization
  • Power analysis
  • Method selection guidelines

Key Library: statsmodels.stats.multitest

Applications:

  • Genomics (thousands of tests)
  • Neuroimaging
  • A/B testing
  • Clinical trials

Decision Guide:

  • Small m (< 20): Bonferroni or Holm
  • Large m (β‰₯ 100): Benjamini-Hochberg (FDR)
  • Confirmatory studies: FWER control
  • Exploratory studies: FDR control

πŸš€ Getting Started

Prerequisites

Python Version: 3.8+

Required Libraries:

pip install numpy pandas matplotlib seaborn scikit-learn scipy statsmodels lifelines

Or use the requirements file:

pip install -r requirements.txt

Installation

  1. Clone the repository:
git clone https://github.com/PavanMudigonda/aiml.git cd aiml/2-maths/islp-book
  1. Create virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Launch Jupyter:
jupyter notebook

πŸ“– Learning Path

Beginner Track (Fundamentals)

  1. Chapter 1: Introduction β†’ Overview and motivation
  2. Chapter 2: Statistical Learning β†’ Understand core concepts
  3. Chapter 3: Linear Regression β†’ First predictive model
  4. Chapter 4: Classification β†’ Categorical outcomes
  5. Chapter 5: Resampling β†’ Model validation

Time: 2-4 weeks

Intermediate Track (Advanced Methods)

  1. Chapter 6: Regularization β†’ Handle complex models
  2. Chapter 7: Beyond Linearity β†’ Non-linear relationships
  3. Chapter 8: Tree Methods β†’ Ensemble learning
  4. Chapter 9: SVMs β†’ Powerful classification

Time: 3-4 weeks

Advanced Track (Specialized Topics)

  1. Chapter 10: Deep Learning β†’ Neural networks
  2. Chapter 11: Survival Analysis β†’ Time-to-event
  3. Chapter 12: Unsupervised β†’ Clustering and PCA
  4. Chapter 13: Multiple Testing β†’ Statistical inference

Time: 3-4 weeks

Total Program: 9-12 weeks for comprehensive mastery


🎯 How to Use These Notebooks

For Self-Study

  1. Read theory sections (markdown cells) carefully
  2. Run code cells sequentially (Shift+Enter)
  3. Modify parameters to see effects
  4. Complete exercises at the end
  5. Compare your solutions with demonstrations

For Teaching

  • Use as lecture supplements
  • Live coding demonstrations
  • Student projects and assignments
  • Flipped classroom materials

For Reference

  • Quick lookup of methods
  • Code snippets for projects
  • Visualization templates
  • Best practices

How To Use This Folder Well

  • Work through the beginner and intermediate tracks before treating this as a reference library.
  • Focus on model assessment, regularization, and method selection because those ideas transfer far beyond classical ML.
  • Pair the notebooks with your own small datasets so the statistical choices feel concrete.

πŸ“Š Datasets Used

DatasetUsed InDescription
Boston HousingCh 3, 6, 7House prices with 13 features
AdvertisingCh 3Sales vs TV/Radio/Newspaper
DefaultCh 4Credit card default prediction
IrisCh 4, 123 species, 4 features
Breast CancerCh 8, 9Binary classification, 30 features
California HousingCh 10Regression, 8 features
MNIST DigitsCh 10, 1264 features (8Γ—8 pixels)
RossiCh 11Recidivism survival data
SyntheticMultipleGenerated for demonstrations

Most datasets are built into scikit-learn or easily accessible.


πŸ”‘ Key Libraries Reference

Core

import numpy as np # Numerical computing import pandas as pd # Data manipulation import matplotlib.pyplot as plt # Plotting import seaborn as sns # Statistical visualization

Machine Learning

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier from sklearn.svm import SVC, SVR from sklearn.neural_network import MLPClassifier, MLPRegressor from sklearn.decomposition import PCA from sklearn.cluster import KMeans, AgglomerativeClustering from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc

Statistical

from scipy import stats # Statistical tests from statsmodels.stats.multitest import multipletests # Multiple testing from lifelines import KaplanMeierFitter, CoxPHFitter # Survival analysis

πŸŽ“ Concepts Covered

Fundamental Concepts

  • Bias-variance trade-off
  • Overfitting and regularization
  • Cross-validation
  • Feature engineering
  • Model selection

Regression Techniques

  • Linear regression
  • Polynomial regression
  • Ridge and Lasso
  • Splines and GAMs
  • SVR

Classification Methods

  • Logistic regression
  • LDA/QDA
  • Decision trees
  • Random forests
  • Boosting
  • SVMs
  • Neural networks

Unsupervised Learning

  • PCA
  • K-Means
  • Hierarchical clustering
  • Dimensionality reduction

Statistical Inference

  • Hypothesis testing
  • Confidence intervals
  • Bootstrap
  • Multiple testing correction
  • Survival analysis

πŸ’‘ Best Practices

Code Style

  • Set random seeds for reproducibility: np.random.seed(42)
  • Use train-test splits: train_test_split(test_size=0.2)
  • Scale features when needed: StandardScaler()
  • Validate with cross-validation

Visualization

  • Clear labels and titles
  • Appropriate color schemes
  • Grid lines for readability
  • Legend placement
  • Figure size optimization

Model Development

  1. Explore data (EDA)
  2. Split data (train/test)
  3. Preprocess (scaling, encoding)
  4. Train model
  5. Validate (cross-validation)
  6. Tune hyperparameters
  7. Test (final evaluation)
  8. Interpret results

πŸ” Quick Method Lookup

Choose Regression Method

  • Linear relationships: Linear Regression
  • Multicollinearity: Ridge or Lasso
  • Feature selection: Lasso
  • Non-linear: Polynomial, Splines, GAMs
  • Complex patterns: Random Forest, Gradient Boosting
  • Small sample: Ridge

Choose Classification Method

  • Linear boundary: Logistic Regression, LDA
  • Non-linear boundary: QDA, KNN, SVM (RBF)
  • Interpretability: Logistic Regression, Decision Tree
  • High accuracy: Random Forest, Gradient Boosting, SVM
  • Large dataset: Logistic Regression, Neural Network
  • Small dataset: LDA, Naive Bayes

Choose Unsupervised Method

  • Dimensionality reduction: PCA
  • Clustering (spherical): K-Means
  • Clustering (arbitrary shape): Hierarchical, DBSCAN
  • Visualization: PCA + scatter plot

πŸ“ Practice Exercises

In-Chapter Exercises

Each chapter includes 5-8 practice exercises covering:

  • Conceptual: Understanding theory
  • Applied: Using methods on new datasets
  • Computational: Implementing from scratch
  • Analysis: Interpreting results

Additional Practice

PRACTICE_EXERCISES.ipynb includes 100+ extra problems:

  • 3 additional problems per chapter (39 total)
  • 8 comprehensive projects
  • 5 challenge problems
  • Solutions and hints
  • Progress tracker

Difficulty Levels:

  • Beginner: Chapters 1-5 (25 exercises)
  • Intermediate: Chapters 6-10 (25 exercises)
  • Advanced: Chapters 11-13 + Projects (50+ exercises)

Recommended Approach:

  1. Complete in-chapter exercises first
  2. Attempt additional exercises in PRACTICE_EXERCISES.ipynb
  3. Work on 2-3 projects
  4. Try 1 challenge problem
  5. Participate in Kaggle competition

🀝 Contributing

Contributions welcome! Areas for improvement:

  • Additional datasets
  • More exercises
  • Alternative implementations
  • Error corrections
  • Clarifications
  • Extended examples

Process:

  1. Fork repository
  2. Create feature branch
  3. Make changes
  4. Test notebooks (run all cells)
  5. Submit pull request

πŸ“š Additional Resources

Books

  • ISLR (original): James, Witten, Hastie, Tibshirani
  • ESL: Elements of Statistical Learning (advanced)
  • Python Data Science Handbook: Jake VanderPlas
  • Hands-On Machine Learning: AurΓ©lien GΓ©ron

Online Courses

  • Stanford CS229 (Machine Learning)
  • Fast.ai (Practical Deep Learning)
  • Coursera Machine Learning Specialization

Documentation


βš–οΈ License

MIT License - feel free to use for learning and teaching.


πŸ“§ Contact

Repository: PavanMudigonda/aimlΒ 

Issues: Report bugs or suggest improvements via GitHub Issues


πŸŽ‰ Acknowledgments

  • Based on β€œAn Introduction to Statistical Learning” by James, Witten, Hastie, and Tibshirani
  • scikit-learn team for excellent ML library
  • Python community for data science ecosystem
  • Contributors and users of this repository

πŸ“ˆ Progress Tracker

Track your learning progress:

Core Chapters

  • Chapter 1: Introduction
  • Chapter 2: Statistical Learning
  • Chapter 3: Linear Regression
  • Chapter 4: Classification
  • Chapter 5: Resampling Methods
  • Chapter 6: Regularization
  • Chapter 7: Beyond Linearity
  • Chapter 8: Tree-Based Methods
  • Chapter 9: Support Vector Machines
  • Chapter 10: Deep Learning
  • Chapter 11: Survival Analysis
  • Chapter 12: Unsupervised Learning
  • Chapter 13: Multiple Testing

Core Progress: 0/13 chapters

Additional Practice

  • Complete all in-chapter exercises (60+ problems)
  • Complete additional exercises from PRACTICE_EXERCISES.ipynb (39 problems)
  • Complete 2-3 projects (8 available)
  • Complete 1+ challenge problem (5 available)
  • Participate in Kaggle competition

Overall Mastery: Track your journey to becoming an ISLP expert!


πŸ† Learning Goals

After completing this series, you will be able to:

βœ… Understand fundamental statistical learning concepts
βœ… Implement regression and classification models
βœ… Apply regularization techniques
βœ… Use ensemble methods effectively
βœ… Work with neural networks
βœ… Perform survival analysis
βœ… Apply unsupervised learning methods
βœ… Handle multiple testing problems
βœ… Choose appropriate methods for different problems
βœ… Interpret and validate model results
βœ… Communicate findings effectively

What Comes Next

Happy Learning! πŸš€

Last updated on