Tokenization
Goal: Understand how text is converted into numbers that language models can process. This is the foundational step before embeddings and neural networks.
Why Tokenization Matters
Every LLM interaction starts with tokenization. When you send a prompt to GPT-4 or Claude, it is first split into tokens - sub-word units that the model actually processes. Understanding tokenization helps you:
- Write better prompts: Avoid patterns that waste tokens and increase cost
- Estimate costs accurately: APIs charge per token, not per word
- Debug model behavior: Some languages tokenize less efficiently than English
- Build production systems: Fast tokenization is critical for throughput
Key fact: “tokenization” is 4 tokens in GPT-4. “Hello” is 1 token. An average English word is ~1.3 tokens.
Notebooks - Work in This Order
| # | Notebook | What You Learn | Time |
|---|---|---|---|
| 1 | 03_tokenizers_quickstart.ipynb | HuggingFace tokenizers API, encode/decode, special tokens | 45 min |
| 2 | 12_tiktoken_example.ipynb | OpenAI’s TikToken library, count tokens in prompts | 30 min |
| 3 | 11_sentencepiece_example.ipynb | Google’s SentencePiece (used in T5, LLaMA) | 30 min |
| 4 | 04_tokenizers_training.ipynb | Train a BPE tokenizer on your own data | 60 min |
| 5 | 05_advanced_training_methods.ipynb | WordPiece, Unigram, and special handling | 45 min |
| 6 | 09_pipeline_components.ipynb | Tokenization as part of the full NLP pipeline | 45 min |
| 7 | 14_token_exploration.ipynb | Hands-on exploration and visualization | 30 min |
| 8 | 13_token_exercises.ipynb | Practice problems with solutions | 45 min |
Key Concepts
The Three Main Tokenization Algorithms
BPE (Byte Pair Encoding) - Used by: GPT-2, GPT-3, GPT-4, RoBERTa
- Starts with individual bytes/characters
- Iteratively merges the most frequent pairs
- Result: common words are single tokens, rare words are split
WordPiece - Used by: BERT, DistilBERT
- Similar to BPE but uses likelihood instead of frequency
- Unknown words become
##suffixparts
SentencePiece / Unigram - Used by: T5, LLaMA, Mistral, Gemma
- Language-agnostic, treats the text as raw bytes
- Works well for multilingual models
Token Vocabulary Size
| Model | Vocab Size | Algorithm |
|---|---|---|
| GPT-2 | 50,257 | BPE |
| GPT-4 / TikToken | 100,277 | BPE |
| BERT | 30,522 | WordPiece |
| LLaMA 3 | 128,256 | BPE (tiktoken-based) |
| T5 | 32,100 | SentencePiece |
Larger vocabulary = fewer tokens per sentence = faster inference, but larger embedding table.
Reference Guides
- 02_intro.md - Conceptual introduction to tokenization
- 10_huggingface_tokenizers_guide.md - Comprehensive HuggingFace guide
- README_TOKENIZERS.md - HuggingFace tokenizers library deep dive
- README_TIKTOKEN.md - OpenAI TikToken library guide
- 06_production_guide.md - Using tokenizers in production
- 07_tokenizer_comparison.md - Comparing different tokenizers
- 08_integration_guide.md - Integrating tokenizers into pipelines
Practice Projects
- Token Counter Tool: Build a CLI tool that counts tokens in any text file for a given model
- Tokenization Visualizer: Color-code tokens in a Streamlit app
- Multilingual Efficiency Analyzer: Compare token efficiency across English, Spanish, Chinese, Arabic
What to Learn Next
After tokenization, move to 05-embeddings/ to learn how tokens become meaningful vectors.
External Resources
| Resource | Type | Link |
|---|---|---|
| Karpathy: Let’s build the GPT Tokenizer | Video (90 min) | https://www.youtube.com/watch?v=zduSFxRajkE |
| HuggingFace Tokenizers Docs | Docs | https://huggingface.co/docs/tokenizers |
| TikToken GitHub | Repo | https://github.com/openai/tiktoken |
| SentencePiece GitHub | Repo | https://github.com/google/sentencepiece |
| Tiktokenizer (web tool) | Tool | https://tiktokenizer.vercel.app |