Skip to Content
30 Inference Optimization

Inference Optimization & Model Serving

Status: This phase is intentionally published as a work in progress. Today, the reliable starting material is 00_START_HERE.ipynb and 03_serving_with_vllm.ipynb. The remaining planned notebooks are listed below so learners can see the intended roadmap.

🎯 Learning Objectives

  • Understand the memory and compute bottlenecks of LLM inference (Memory Wall vs Compute Wall).
  • Master PagedAttention and KV Cache management.
  • Apply post-training quantization techniques (AWQ, GPTQ, EXL2).
  • Deploy models using high-throughput serving engines like vLLM and TensorRT-LLM.
  • Implement advanced decoding strategies like Speculative Decoding to reduce latency.
  • Understand prefix caching, chunked prefill, and continuous batching.
  • Compare local serving engines such as vLLM, TensorRT-LLM, and SGLang.
  • Measure throughput, TTFT, decode speed, and cost-per-token trade-offs.

⏱️ Time Estimate

  • Expected time: 6-8 hours

📚 Prerequisites

  • Completion of 14-local-llms
  • Completion of 04-token
  • Basic understanding of PyTorch devices and CUDA memory.

🛠️ Current and Planned Materials

  • Available now: 00_START_HERE.ipynb, 03_serving_with_vllm.ipynb

  • Planned next: KV cache, quantization, speculative decoding, and runtime comparison notebooks

  • 01_kv_cache_paged_attention.ipynb - Visualizing and managing the KV cache.

  • 02_quantization_deep_dive.ipynb - Quantizing a Llama-3 model from FP16 to INT4 using AWQ.

  • 03_serving_with_vllm.ipynb - Quickstart notebook for vLLM-based serving and batching.

  • 04_speculative_decoding.ipynb - Speeding up inference using a small draft model.

  • Add a TensorRT-LLM / SGLang comparison walkthrough.

  • Add a prefix caching and chunked prefill tuning walkthrough.

How To Use This Phase Right Now

If you are studying today:

  1. Use this phase as an introduction, not a complete mastery path.
  2. Finish 03_serving_with_vllm.ipynb and pair it with 14-local-llms/ and 09-mlops/.
  3. Return later for deeper optimization once the planned notebooks land.

2026 Topics This Phase Should Cover

  • PagedAttention and KV cache layout
  • Prefix caching and reuse across repeated prompts
  • Chunked prefill, continuous batching, and scheduler behavior
  • Quantization stacks: AWQ, GPTQ, EXL2, GGUF, FP8 where available
  • Serving runtimes: vLLM, TensorRT-LLM, SGLang, TGI
  • Speculative decoding and draft-model assisted generation
  • Throughput metrics: TTFT, tokens/sec, concurrency saturation, memory footprint

📖 Resources

What Comes Next

Last updated on