Inference Optimization & Model Serving

Status: This phase is intentionally published as a work in progress. Today, the reliable starting material is 00_START_HERE.ipynb and 03_serving_with_vllm.ipynb. The remaining planned notebooks are listed below so learners can see the intended roadmap.

🎯 Learning Objectives

Understand the memory and compute bottlenecks of LLM inference (Memory Wall vs Compute Wall).
Master PagedAttention and KV Cache management.
Apply post-training quantization techniques (AWQ, GPTQ, EXL2).
Deploy models using high-throughput serving engines like vLLM and TensorRT-LLM.
Implement advanced decoding strategies like Speculative Decoding to reduce latency.
Understand prefix caching, chunked prefill, and continuous batching.
Compare local serving engines such as vLLM, TensorRT-LLM, and SGLang.
Measure throughput, TTFT, decode speed, and cost-per-token trade-offs.

⏱️ Time Estimate

Expected time: 6-8 hours

📚 Prerequisites

Completion of 14-local-llms
Completion of 04-token
Basic understanding of PyTorch devices and CUDA memory.

🛠️ Current and Planned Materials

Available now: 00_START_HERE.ipynb, 03_serving_with_vllm.ipynb
Planned next: KV cache, quantization, speculative decoding, and runtime comparison notebooks
01_kv_cache_paged_attention.ipynb - Visualizing and managing the KV cache.
02_quantization_deep_dive.ipynb - Quantizing a Llama-3 model from FP16 to INT4 using AWQ.
03_serving_with_vllm.ipynb - Quickstart notebook for vLLM-based serving and batching.
04_speculative_decoding.ipynb - Speeding up inference using a small draft model.
Add a TensorRT-LLM / SGLang comparison walkthrough.
Add a prefix caching and chunked prefill tuning walkthrough.

How To Use This Phase Right Now

If you are studying today:

Use this phase as an introduction, not a complete mastery path.
Finish 03_serving_with_vllm.ipynb and pair it with 14-local-llms/ and 09-mlops/.
Return later for deeper optimization once the planned notebooks land.

2026 Topics This Phase Should Cover

PagedAttention and KV cache layout
Prefix caching and reuse across repeated prompts
Chunked prefill, continuous batching, and scheduler behavior
Quantization stacks: AWQ, GPTQ, EXL2, GGUF, FP8 where available
Serving runtimes: vLLM, TensorRT-LLM, SGLang, TGI
Speculative decoding and draft-model assisted generation
Throughput metrics: TTFT, tokens/sec, concurrency saturation, memory footprint

📖 Resources

What Comes Next

Continue to ../09-mlops/README.md if you want deployment, monitoring, and production rollout discipline around serving systems.
Continue to ../14-local-llms/README.md if you want broader local-model workflows before going deeper on optimization.
Continue to ../29-ai-hardware-llm-validation/README.md if your interest shifts toward systems validation, benchmarking, and infrastructure constraints.

Last updated on May 24, 2026

29 AI Hardware & LLM Validation 01 Start Here