Inference Optimization & Model Serving
Status: This phase is intentionally published as a work in progress. Today, the reliable starting material is
00_START_HERE.ipynband03_serving_with_vllm.ipynb. The remaining planned notebooks are listed below so learners can see the intended roadmap.
🎯 Learning Objectives
- Understand the memory and compute bottlenecks of LLM inference (Memory Wall vs Compute Wall).
- Master PagedAttention and KV Cache management.
- Apply post-training quantization techniques (AWQ, GPTQ, EXL2).
- Deploy models using high-throughput serving engines like vLLM and TensorRT-LLM.
- Implement advanced decoding strategies like Speculative Decoding to reduce latency.
- Understand prefix caching, chunked prefill, and continuous batching.
- Compare local serving engines such as vLLM, TensorRT-LLM, and SGLang.
- Measure throughput, TTFT, decode speed, and cost-per-token trade-offs.
⏱️ Time Estimate
- Expected time: 6-8 hours
📚 Prerequisites
- Completion of 14-local-llms
- Completion of 04-token
- Basic understanding of PyTorch devices and CUDA memory.
🛠️ Current and Planned Materials
-
Available now:
00_START_HERE.ipynb,03_serving_with_vllm.ipynb -
Planned next: KV cache, quantization, speculative decoding, and runtime comparison notebooks
-
01_kv_cache_paged_attention.ipynb- Visualizing and managing the KV cache. -
02_quantization_deep_dive.ipynb- Quantizing a Llama-3 model from FP16 to INT4 using AWQ. -
03_serving_with_vllm.ipynb- Quickstart notebook for vLLM-based serving and batching. -
04_speculative_decoding.ipynb- Speeding up inference using a small draft model. -
Add a TensorRT-LLM / SGLang comparison walkthrough.
-
Add a prefix caching and chunked prefill tuning walkthrough.
How To Use This Phase Right Now
If you are studying today:
- Use this phase as an introduction, not a complete mastery path.
- Finish
03_serving_with_vllm.ipynband pair it with14-local-llms/and09-mlops/. - Return later for deeper optimization once the planned notebooks land.
2026 Topics This Phase Should Cover
- PagedAttention and KV cache layout
- Prefix caching and reuse across repeated prompts
- Chunked prefill, continuous batching, and scheduler behavior
- Quantization stacks: AWQ, GPTQ, EXL2, GGUF, FP8 where available
- Serving runtimes: vLLM, TensorRT-LLM, SGLang, TGI
- Speculative decoding and draft-model assisted generation
- Throughput metrics: TTFT, tokens/sec, concurrency saturation, memory footprint
📖 Resources
- vLLM Documentation
- HuggingFace Quantization Guide
- PagedAttention Paper
- TensorRT-LLM Documentation
- SGLang Documentation
What Comes Next
- Continue to ../09-mlops/README.md if you want deployment, monitoring, and production rollout discipline around serving systems.
- Continue to ../14-local-llms/README.md if you want broader local-model workflows before going deeper on optimization.
- Continue to ../29-ai-hardware-llm-validation/README.md if your interest shifts toward systems validation, benchmarking, and infrastructure constraints.