Introduction to Statistical Learning with Python (ISLP)
A comprehensive collection of Jupyter notebooks covering the foundational concepts and advanced techniques in statistical learning and machine learning, based on βAn Introduction to Statistical Learningβ adapted for Python.
Use this folder when you want a statistics-first view of machine learning rather than a purely optimization-first or deep-learning-first path. It is especially useful for classical modeling, validation, interpretability, and statistical decision making.
π Overview
This series provides hands-on implementations of statistical learning methods with Python, featuring:
- 13 comprehensive chapters covering fundamental to advanced topics
- Theory + Practice: Mathematical formulations with executable code
- Real datasets: Practical examples using classic ML datasets
- Visualizations: Professional plots for understanding concepts
- Exercises: Practice problems for each chapter
- 100+ additional exercises: See PRACTICE_EXERCISES.ipynb
ποΈ Chapter Guide
Chapter 1: Introduction
File: 01_introduction.ipynb (NEW!)
Topics:
- What is statistical learning?
- Supervised vs unsupervised learning
- Real-world applications
- The ML workflow
- Model assessment metrics
- Overfitting vs underfitting
Demonstrations:
- California Housing (regression example)
- Breast Cancer (classification example)
- Iris Clustering (unsupervised example)
- Train-test split and overfitting visualization
- Comprehensive metrics comparison
Key Concepts:
- The learning framework: Y = f(X) + Ξ΅
- Bias-variance trade-off
- Train-test split importance
- Regression vs classification metrics
- Supervised vs unsupervised paradigms
Practice: 8 comprehensive exercises covering all intro concepts
Chapter 2: Statistical Learning
File: 02_statistical_learning.ipynb (25KB)
Topics:
- Supervised vs unsupervised learning
- Regression vs classification
- Bias-variance trade-off
- Training vs test error
- Model assessment and selection
Key Concepts:
- Reducible vs irreducible error
- Overfitting and underfitting
- Cross-validation basics
Chapter 3: Linear Regression
File: 03_linear_regression.ipynb (40KB)
Topics:
- Simple linear regression
- Multiple linear regression
- Least squares estimation
- Hypothesis testing (t-tests, F-tests)
- RΒ² and adjusted RΒ²
- Residual analysis
Demonstrations:
- Boston Housing dataset
- Advertising dataset
- Confidence vs prediction intervals
- Diagnostic plots
Key Formulas:
Ξ²Μ = (X'X)β»ΒΉX'y
RSS = Ξ£(yi - Ε·i)Β²
RΒ² = 1 - RSS/TSSChapter 4: Classification
File: 04_classification.ipynb (32KB)
Topics:
- Logistic regression
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
- Naive Bayes
- K-Nearest Neighbors (KNN)
Demonstrations:
- Binary classification (Default dataset)
- Multi-class classification (Iris)
- Decision boundaries
- Confusion matrices
- ROC curves and AUC
Metrics:
- Accuracy, precision, recall, F1-score
- Sensitivity and specificity
- Classification error rate
Chapter 5: Resampling Methods
File: 05_resampling_methods.ipynb (37KB)
Topics:
- Cross-validation (LOOCV, k-fold)
- Bootstrap
- Model selection
- Uncertainty estimation
Demonstrations:
- k-fold CV for polynomial degree selection
- LOOCV vs k-fold comparison
- Bootstrap confidence intervals
- Bootstrap standard errors
Applications:
- Estimating test error
- Model comparison
- Parameter uncertainty
- Sample size effects
Chapter 6: Linear Model Selection and Regularization
File: 06_regularization.ipynb (46KB)
Topics:
- Subset selection (best subset, forward, backward)
- Ridge regression (L2 penalty)
- Lasso regression (L1 penalty)
- Elastic Net
- Principal Component Regression (PCR)
Key Concepts:
- Regularization path
- Cross-validation for Ξ» selection
- Feature selection vs shrinkage
- Multicollinearity handling
Formulas:
Ridge: minimize RSS + λΣβj²
Lasso: minimize RSS + λΣ|βj|
Elastic Net: minimize RSS + Ξ»βΞ£|Ξ²j| + Ξ»βΣβjΒ²Chapter 7: Moving Beyond Linearity
File: 07_nonlinearity.ipynb (35KB)
Topics:
- Polynomial regression
- Step functions
- Regression splines (B-splines)
- Smoothing splines
- Generalized Additive Models (GAMs)
Demonstrations:
- Polynomial degree selection via CV
- Spline knot placement
- GAMs with multiple predictors
- Method comparison
Use Cases:
- Non-linear relationships
- Flexible modeling
- Interpretable non-linearity
- Smooth curve fitting
Chapter 8: Tree-Based Methods
File: 08_tree_methods.ipynb (40-45KB)
Topics:
- Decision trees (CART)
- Bagging (Bootstrap Aggregation)
- Random Forests
- Boosting (AdaBoost, Gradient Boosting)
- Feature importance
Demonstrations:
- Tree pruning and depth control
- Out-of-bag (OOB) error
- Feature importance visualization
- Ensemble comparison
- Breast Cancer dataset
Key Algorithms:
- DecisionTreeClassifier/Regressor
- BaggingClassifier
- RandomForestClassifier
- AdaBoostClassifier
- GradientBoostingClassifier
Chapter 9: Support Vector Machines
File: 09_support_vector_machines.ipynb (40-45KB)
Topics:
- Maximal margin classifier
- Support Vector Classifier (soft margin)
- Kernel methods (Linear, Polynomial, RBF)
- Multi-class SVMs
- Support Vector Regression (SVR)
Demonstrations:
- C parameter tuning (margin control)
- Kernel comparison
- Gamma parameter effects (RBF)
- Hyperparameter grid search
- Breast Cancer classification
Key Concepts:
- Margin maximization
- Support vectors
- Kernel trick
- Slack variables (ΞΎ)
Chapter 10: Deep Learning
File: 10_deep_learning.ipynb (35-40KB)
Topics:
- Neural network fundamentals
- Activation functions (ReLU, sigmoid, tanh, softmax)
- Single vs deep networks
- Backpropagation
- Regularization (L2, dropout, early stopping)
- MLPClassifier and MLPRegressor
Demonstrations:
- California Housing (regression)
- MNIST digit classification
- Hidden layer comparison
- Learning curves
- Regularization effects
Architecture:
Input β Hiddenβ β Hiddenβ β ... β Output
Each layer: z = Wx + b, a = Ο(z)Chapter 11: Survival Analysis
File: 11_survival_analysis.ipynb (45-50KB)
Topics:
- Survival functions
- Censoring (right, left, interval)
- Kaplan-Meier estimator
- Log-Rank test
- Cox Proportional Hazards model
- Hazard ratios
Demonstrations:
- Rossi recidivism dataset
- Survival curves by group
- Median survival time
- Hazard ratio interpretation
- Proportional hazards assumption
Key Library: lifelines
Applications:
- Time-to-event analysis
- Medical studies
- Customer churn
- Equipment failure
Chapter 12: Unsupervised Learning
File: 12_unsupervised_learning.ipynb (40-45KB)
Topics:
- Principal Component Analysis (PCA)
- K-Means clustering
- Hierarchical clustering
- DBSCAN
- Dimensionality reduction
Demonstrations:
- Iris dataset (PCA: 4D β 2D)
- Scree plots and variance explained
- Elbow method for K selection
- Silhouette analysis
- Dendrogram visualization
- MNIST digits clustering
Linkage Methods:
- Complete
- Average
- Single
- Ward
Validation:
- Silhouette score
- Davies-Bouldin index
- Inertia (within-cluster SS)
Chapter 13: Multiple Testing
File: 13_multiple_testing.ipynb (35-40KB)
Topics:
- Multiple testing problem
- Family-Wise Error Rate (FWER)
- False Discovery Rate (FDR)
- Bonferroni correction
- Holmβs method
- Benjamini-Hochberg procedure
- Benjamini-Yekutieli procedure
Demonstrations:
- Simulation of Type I error inflation
- FWER vs FDR comparison
- Threshold visualization
- Power analysis
- Method selection guidelines
Key Library: statsmodels.stats.multitest
Applications:
- Genomics (thousands of tests)
- Neuroimaging
- A/B testing
- Clinical trials
Decision Guide:
- Small m (< 20): Bonferroni or Holm
- Large m (β₯ 100): Benjamini-Hochberg (FDR)
- Confirmatory studies: FWER control
- Exploratory studies: FDR control
π Getting Started
Prerequisites
Python Version: 3.8+
Required Libraries:
pip install numpy pandas matplotlib seaborn scikit-learn scipy statsmodels lifelinesOr use the requirements file:
pip install -r requirements.txtInstallation
- Clone the repository:
git clone https://github.com/PavanMudigonda/aiml.git
cd aiml/2-maths/islp-book- Create virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Launch Jupyter:
jupyter notebookπ Learning Path
Beginner Track (Fundamentals)
- Chapter 1: Introduction β Overview and motivation
- Chapter 2: Statistical Learning β Understand core concepts
- Chapter 3: Linear Regression β First predictive model
- Chapter 4: Classification β Categorical outcomes
- Chapter 5: Resampling β Model validation
Time: 2-4 weeks
Intermediate Track (Advanced Methods)
- Chapter 6: Regularization β Handle complex models
- Chapter 7: Beyond Linearity β Non-linear relationships
- Chapter 8: Tree Methods β Ensemble learning
- Chapter 9: SVMs β Powerful classification
Time: 3-4 weeks
Advanced Track (Specialized Topics)
- Chapter 10: Deep Learning β Neural networks
- Chapter 11: Survival Analysis β Time-to-event
- Chapter 12: Unsupervised β Clustering and PCA
- Chapter 13: Multiple Testing β Statistical inference
Time: 3-4 weeks
Total Program: 9-12 weeks for comprehensive mastery
π― How to Use These Notebooks
For Self-Study
- Read theory sections (markdown cells) carefully
- Run code cells sequentially (Shift+Enter)
- Modify parameters to see effects
- Complete exercises at the end
- Compare your solutions with demonstrations
For Teaching
- Use as lecture supplements
- Live coding demonstrations
- Student projects and assignments
- Flipped classroom materials
For Reference
- Quick lookup of methods
- Code snippets for projects
- Visualization templates
- Best practices
How To Use This Folder Well
- Work through the beginner and intermediate tracks before treating this as a reference library.
- Focus on model assessment, regularization, and method selection because those ideas transfer far beyond classical ML.
- Pair the notebooks with your own small datasets so the statistical choices feel concrete.
π Datasets Used
| Dataset | Used In | Description |
|---|---|---|
| Boston Housing | Ch 3, 6, 7 | House prices with 13 features |
| Advertising | Ch 3 | Sales vs TV/Radio/Newspaper |
| Default | Ch 4 | Credit card default prediction |
| Iris | Ch 4, 12 | 3 species, 4 features |
| Breast Cancer | Ch 8, 9 | Binary classification, 30 features |
| California Housing | Ch 10 | Regression, 8 features |
| MNIST Digits | Ch 10, 12 | 64 features (8Γ8 pixels) |
| Rossi | Ch 11 | Recidivism survival data |
| Synthetic | Multiple | Generated for demonstrations |
Most datasets are built into scikit-learn or easily accessible.
π Key Libraries Reference
Core
import numpy as np # Numerical computing
import pandas as pd # Data manipulation
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Statistical visualizationMachine Learning
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.svm import SVC, SVR
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, aucStatistical
from scipy import stats # Statistical tests
from statsmodels.stats.multitest import multipletests # Multiple testing
from lifelines import KaplanMeierFitter, CoxPHFitter # Survival analysisπ Concepts Covered
Fundamental Concepts
- Bias-variance trade-off
- Overfitting and regularization
- Cross-validation
- Feature engineering
- Model selection
Regression Techniques
- Linear regression
- Polynomial regression
- Ridge and Lasso
- Splines and GAMs
- SVR
Classification Methods
- Logistic regression
- LDA/QDA
- Decision trees
- Random forests
- Boosting
- SVMs
- Neural networks
Unsupervised Learning
- PCA
- K-Means
- Hierarchical clustering
- Dimensionality reduction
Statistical Inference
- Hypothesis testing
- Confidence intervals
- Bootstrap
- Multiple testing correction
- Survival analysis
π‘ Best Practices
Code Style
- Set random seeds for reproducibility:
np.random.seed(42) - Use train-test splits:
train_test_split(test_size=0.2) - Scale features when needed:
StandardScaler() - Validate with cross-validation
Visualization
- Clear labels and titles
- Appropriate color schemes
- Grid lines for readability
- Legend placement
- Figure size optimization
Model Development
- Explore data (EDA)
- Split data (train/test)
- Preprocess (scaling, encoding)
- Train model
- Validate (cross-validation)
- Tune hyperparameters
- Test (final evaluation)
- Interpret results
π Quick Method Lookup
Choose Regression Method
- Linear relationships: Linear Regression
- Multicollinearity: Ridge or Lasso
- Feature selection: Lasso
- Non-linear: Polynomial, Splines, GAMs
- Complex patterns: Random Forest, Gradient Boosting
- Small sample: Ridge
Choose Classification Method
- Linear boundary: Logistic Regression, LDA
- Non-linear boundary: QDA, KNN, SVM (RBF)
- Interpretability: Logistic Regression, Decision Tree
- High accuracy: Random Forest, Gradient Boosting, SVM
- Large dataset: Logistic Regression, Neural Network
- Small dataset: LDA, Naive Bayes
Choose Unsupervised Method
- Dimensionality reduction: PCA
- Clustering (spherical): K-Means
- Clustering (arbitrary shape): Hierarchical, DBSCAN
- Visualization: PCA + scatter plot
π Practice Exercises
In-Chapter Exercises
Each chapter includes 5-8 practice exercises covering:
- Conceptual: Understanding theory
- Applied: Using methods on new datasets
- Computational: Implementing from scratch
- Analysis: Interpreting results
Additional Practice
PRACTICE_EXERCISES.ipynb includes 100+ extra problems:
- 3 additional problems per chapter (39 total)
- 8 comprehensive projects
- 5 challenge problems
- Solutions and hints
- Progress tracker
Difficulty Levels:
- Beginner: Chapters 1-5 (25 exercises)
- Intermediate: Chapters 6-10 (25 exercises)
- Advanced: Chapters 11-13 + Projects (50+ exercises)
Recommended Approach:
- Complete in-chapter exercises first
- Attempt additional exercises in PRACTICE_EXERCISES.ipynb
- Work on 2-3 projects
- Try 1 challenge problem
- Participate in Kaggle competition
π€ Contributing
Contributions welcome! Areas for improvement:
- Additional datasets
- More exercises
- Alternative implementations
- Error corrections
- Clarifications
- Extended examples
Process:
- Fork repository
- Create feature branch
- Make changes
- Test notebooks (run all cells)
- Submit pull request
π Additional Resources
Books
- ISLR (original): James, Witten, Hastie, Tibshirani
- ESL: Elements of Statistical Learning (advanced)
- Python Data Science Handbook: Jake VanderPlas
- Hands-On Machine Learning: AurΓ©lien GΓ©ron
Online Courses
- Stanford CS229 (Machine Learning)
- Fast.ai (Practical Deep Learning)
- Coursera Machine Learning Specialization
Documentation
βοΈ License
MIT License - feel free to use for learning and teaching.
π§ Contact
Repository: PavanMudigonda/aimlΒ
Issues: Report bugs or suggest improvements via GitHub Issues
π Acknowledgments
- Based on βAn Introduction to Statistical Learningβ by James, Witten, Hastie, and Tibshirani
- scikit-learn team for excellent ML library
- Python community for data science ecosystem
- Contributors and users of this repository
π Progress Tracker
Track your learning progress:
Core Chapters
- Chapter 1: Introduction
- Chapter 2: Statistical Learning
- Chapter 3: Linear Regression
- Chapter 4: Classification
- Chapter 5: Resampling Methods
- Chapter 6: Regularization
- Chapter 7: Beyond Linearity
- Chapter 8: Tree-Based Methods
- Chapter 9: Support Vector Machines
- Chapter 10: Deep Learning
- Chapter 11: Survival Analysis
- Chapter 12: Unsupervised Learning
- Chapter 13: Multiple Testing
Core Progress: 0/13 chapters
Additional Practice
- Complete all in-chapter exercises (60+ problems)
- Complete additional exercises from PRACTICE_EXERCISES.ipynb (39 problems)
- Complete 2-3 projects (8 available)
- Complete 1+ challenge problem (5 available)
- Participate in Kaggle competition
Overall Mastery: Track your journey to becoming an ISLP expert!
π Learning Goals
After completing this series, you will be able to:
β
Understand fundamental statistical learning concepts
β
Implement regression and classification models
β
Apply regularization techniques
β
Use ensemble methods effectively
β
Work with neural networks
β
Perform survival analysis
β
Apply unsupervised learning methods
β
Handle multiple testing problems
β
Choose appropriate methods for different problems
β
Interpret and validate model results
β
Communicate findings effectively
What Comes Next
- Continue to ../mlpp-book/README.md if you want a more explicitly probabilistic extension of similar topics.
- Continue to ../advanced/README.md if you want deeper theory after the statistical foundations click.
- Return to ../../28-practical-data-science/README.md or ../../16-model-evaluation/README.md to apply the ideas more directly.
Happy Learning! π