Local LLMs

This module should help you answer a practical question: when does running models locally make sense, and what trade-offs do you accept in exchange for privacy, cost control, and deployment flexibility?

Actual Module Contents

Recommended Order

Start with Ollama and the model overview
Then build a local RAG workflow
Then study serving and API patterns
Finish with speculative decoding and performance considerations

What To Learn Here

The difference between hosted APIs and local inference
How quantization and model size affect usability
What Ollama is good at and where it is limiting
How to expose a local model behind an API
Why latency and throughput tuning matter once a prototype works

Current Local LLM Stack To Know In 2026

Ollama for the simplest developer experience
llama.cpp and GGUF for broad hardware compatibility
MLX for Apple Silicon-native training and inference
vLLM and SGLang for higher-throughput serving on stronger local GPUs
OpenAI-compatible local gateways for app portability across hosted and self-hosted backends
AI Toolkit for VS Code for model browsing, local playground, fine-tuning (QLoRA), and evaluation - all inside the editor

Study Advice

Keep the first pass practical: install one tool, run one model, ship one API.
Do not optimize before measuring.
Compare local quality against your hosted baseline before committing to an on-device stack.

Good Follow-On Projects

A private document assistant
A local coding helper with retrieval
A lightweight OpenAI-compatible local serving layer
A Mac-first MLX workflow for Apple Silicon laptops
A benchmark that compares Ollama, llama.cpp, and vLLM on the same model

What Comes Next

Continue to ../30-inference-optimization/README.md for serving and performance tuning concepts.
Continue to ../09-mlops/README.md if you want deployment and monitoring discipline.
Continue to ../15-ai-agents/README.md if you want local tool-using systems on top of open models.

Last updated on May 24, 2026

13 Multimodal 01 Start Here