Skip to Content
04 Token

Tokenization

Goal: Understand how text is converted into numbers that language models can process. This is the foundational step before embeddings and neural networks.


Why Tokenization Matters

Every LLM interaction starts with tokenization. When you send a prompt to GPT-4 or Claude, it is first split into tokens - sub-word units that the model actually processes. Understanding tokenization helps you:

  • Write better prompts: Avoid patterns that waste tokens and increase cost
  • Estimate costs accurately: APIs charge per token, not per word
  • Debug model behavior: Some languages tokenize less efficiently than English
  • Build production systems: Fast tokenization is critical for throughput

Key fact: “tokenization” is 4 tokens in GPT-4. “Hello” is 1 token. An average English word is ~1.3 tokens.


Notebooks - Work in This Order

#NotebookWhat You LearnTime
103_tokenizers_quickstart.ipynbHuggingFace tokenizers API, encode/decode, special tokens45 min
212_tiktoken_example.ipynbOpenAI’s TikToken library, count tokens in prompts30 min
311_sentencepiece_example.ipynbGoogle’s SentencePiece (used in T5, LLaMA)30 min
404_tokenizers_training.ipynbTrain a BPE tokenizer on your own data60 min
505_advanced_training_methods.ipynbWordPiece, Unigram, and special handling45 min
609_pipeline_components.ipynbTokenization as part of the full NLP pipeline45 min
714_token_exploration.ipynbHands-on exploration and visualization30 min
813_token_exercises.ipynbPractice problems with solutions45 min

Key Concepts

The Three Main Tokenization Algorithms

BPE (Byte Pair Encoding) - Used by: GPT-2, GPT-3, GPT-4, RoBERTa

  • Starts with individual bytes/characters
  • Iteratively merges the most frequent pairs
  • Result: common words are single tokens, rare words are split

WordPiece - Used by: BERT, DistilBERT

  • Similar to BPE but uses likelihood instead of frequency
  • Unknown words become ##suffix parts

SentencePiece / Unigram - Used by: T5, LLaMA, Mistral, Gemma

  • Language-agnostic, treats the text as raw bytes
  • Works well for multilingual models

Token Vocabulary Size

ModelVocab SizeAlgorithm
GPT-250,257BPE
GPT-4 / TikToken100,277BPE
BERT30,522WordPiece
LLaMA 3128,256BPE (tiktoken-based)
T532,100SentencePiece

Larger vocabulary = fewer tokens per sentence = faster inference, but larger embedding table.


Reference Guides


Practice Projects

  1. Token Counter Tool: Build a CLI tool that counts tokens in any text file for a given model
  2. Tokenization Visualizer: Color-code tokens in a Streamlit app
  3. Multilingual Efficiency Analyzer: Compare token efficiency across English, Spanish, Chinese, Arabic

What to Learn Next

After tokenization, move to 05-embeddings/ to learn how tokens become meaningful vectors.


External Resources

ResourceTypeLink
Karpathy: Let’s build the GPT TokenizerVideo (90 min)https://www.youtube.com/watch?v=zduSFxRajkE 
HuggingFace Tokenizers DocsDocshttps://huggingface.co/docs/tokenizers 
TikToken GitHubRepohttps://github.com/openai/tiktoken 
SentencePiece GitHubRepohttps://github.com/google/sentencepiece 
Tiktokenizer (web tool)Toolhttps://tiktokenizer.vercel.app 
Last updated on