Skip to Content
06 Neural Networks09 Attention Explained

Attention Mechanism: The Breakthrough Innovation

Table of Contents

  1. Why Attention Was Invented
  2. The Core Concept
  3. Self-Attention Step by Step
  4. Query, Key, Value (QKV)
  5. Scaled Dot-Product Attention
  6. Multi-Head Attention
  7. Practical Examples
  8. Why It Works So Well

Why Attention Was Invented

The Problem with RNNs

Before attention, sequence models (RNNs, LSTMs) had a fundamental limitation:

Sentence: "The cat that ate the mouse that lived in the barn ran away" Problem: By the time model gets to "ran", information about "cat" has been compressed through many timesteps and starts to fade.

RNN limitations:

  • ❌ Sequential processing (slow)
  • ❌ Information bottleneck (fixed-size hidden state)
  • ❌ Vanishing gradients for long sequences
  • ❌ Cannot look back at earlier words directly

The Solution: Attention

Key Innovation: Let the model look back at ANY previous word when processing current word.

When predicting "ran": - Attention can directly look at "cat" (high attention) - Also look at "that", "ate", etc. (lower attention) - Weights determine how much to focus on each word No information loss!

Attention benefits:

  • ✅ Direct access to any previous position
  • ✅ Parallel processing possible
  • ✅ No vanishing gradients
  • ✅ Model learns what to focus on

The Core Concept

The Human Analogy

When you read this sentence: “The Eiffel Tower is in Paris, which is the capital of France.”

To answer “What city has the Eiffel Tower?”, you:

  1. Scan the sentence
  2. Attend to relevant words: “Eiffel Tower”, “Paris”
  3. Ignore less relevant: “which”, “is”, “the”, “of”
  4. Form answer from attended information

Attention mechanism does the same thing!

The Intuition

Question: "Where is the Eiffel Tower?" Context: "The Eiffel Tower is located in Paris, France." Attention weights: The [0.05] ▁ Eiffel [0.30] ████████ Tower [0.25] ███████ is [0.02] ▁ located [0.05] █ in [0.03] ▁ Paris [0.25] ███████ France [0.05] █ Model focuses heavily on: "Eiffel", "Tower", "Paris"

From Attention to Self-Attention

Regular Attention: Query one sequence, attend to another

  • Used in encoder-decoder models (translation)
  • Example: Query = “Wo ist der Eiffelturm?”, Attend to = “The Eiffel Tower is in Paris”

Self-Attention: Query and attend to the SAME sequence

  • Used in transformers (BERT, GPT)
  • Example: Each word attends to every word in same sentence
  • Helps understand relationships within text

Self-Attention Step by Step

Let’s build intuition with a simple example.

Example Sentence

"The cat sat"

Goal: For each word, compute attention to all words (including itself).

Step 1: Embed Words

Convert words to vectors (from Phase 5: Embeddings):

# Simplified 4-dimensional embeddings embeddings = { "The": [0.2, 0.1, 0.5, 0.3], "cat": [0.5, 0.8, 0.2, 0.1], "sat": [0.1, 0.3, 0.9, 0.4] } # Stack into matrix X = [[0.2, 0.1, 0.5, 0.3], # The [0.5, 0.8, 0.2, 0.1], # cat [0.1, 0.3, 0.9, 0.4]] # sat # Shape: (3, 4) - 3 words, 4 dimensions

Step 2: Create Q, K, V Matrices

We need three transformations of our input:

# Weight matrices (learned during training) W_q = random_matrix(4, 4) # Query weights W_k = random_matrix(4, 4) # Key weights W_v = random_matrix(4, 4) # Value weights # Transform embeddings Q = X @ W_q # Query: What am I looking for? K = X @ W_k # Key: What do I contain? V = X @ W_v # Value: What do I output? # All have shape (3, 4)

Intuition:

  • Query (Q): “What information do I need?”
  • Key (K): “What information do I have?”
  • Value (V): “What information do I output?”

Step 3: Compute Attention Scores

Measure similarity between queries and keys:

scores = Q @ K.T # Matrix multiplication # Shape: (3, 3) # Example result: scores = [[2.1, 1.5, 0.8], # The attends to: The, cat, sat [1.5, 3.2, 1.9], # cat attends to: The, cat, sat [0.8, 1.9, 2.7]] # sat attends to: The, cat, sat

Interpretation:

  • scores[1, 0] = 1.5 means “cat” has score 1.5 when attending to “The”
  • scores[1, 1] = 3.2 means “cat” has highest score when attending to itself
  • Higher score = more relevant

Step 4: Scale Scores

Divide by square root of dimension to stabilize gradients:

d_k = 4 # Dimension of keys scaled_scores = scores / sqrt(d_k)

Why scale?

  • Prevents very large values
  • Keeps softmax gradients well-behaved
  • Becomes more important with larger dimensions

Step 5: Apply Softmax

Convert scores to probability distribution:

attention_weights = softmax(scaled_scores) # Example result: attention_weights = [ [0.55, 0.32, 0.13], # The: 55% to itself, 32% to cat, 13% to sat [0.15, 0.60, 0.25], # cat: 15% to The, 60% to itself, 25% to sat [0.10, 0.30, 0.60] # sat: 10% to The, 30% to cat, 60% to itself ]

Properties:

  • Each row sums to 1.0
  • Represents how much to attend to each position
  • These are the famous “attention weights”

Step 6: Apply Weights to Values

Compute weighted sum of values:

output = attention_weights @ V # Shape: (3, 4) # For word "cat" (row 1): output[1] = 0.15 * V[0] + 0.60 * V[1] + 0.25 * V[2] # ↑ ↑ ↑ # from "The" from "cat" from "sat"

Result: Each word’s output is a weighted combination of all words’ values.

Complete Formula

Putting it all together:

Attention(Q, K, V) = softmax(QK^T / √d_k) V Where: - Q = queries (what I'm looking for) - K = keys (what I have) - V = values (what I output) - d_k = dimension of keys - / √d_k = scaling factor - softmax = converts to probabilities

Query, Key, Value (QKV)

The Database Analogy

Think of attention like a database lookup:

# Database with (key, value) pairs database = { "Paris": "Capital of France, home to Eiffel Tower", "London": "Capital of UK, home to Big Ben", "Tokyo": "Capital of Japan, largest city" } # Query query = "Where is the Eiffel Tower?" # Step 1: Match query to keys scores = { "Paris": 0.85, # High match! "London": 0.15, "Tokyo": 0.10 } # Step 2: Softmax to get attention weights attention = softmax(scores) # [0.70, 0.17, 0.13] # Step 3: Retrieve weighted combination of values result = 0.70 * database["Paris"] + 0.17 * database["London"] + 0.13 * database["Tokyo"] # Mostly Paris information!

In Neural Networks

# Input embeddings X = word_embeddings # Shape: (seq_len, d_model) # Linear transformations (learned) Q = X @ W_q # "What to search for" K = X @ W_k # "How to identify relevant info" V = X @ W_v # "What info to pass forward" # Attention computation scores = Q @ K.T # Similarity weights = softmax(scores) # Probability output = weights @ V # Weighted combination

Why Three Matrices?

Question: Why not just use X directly?

Answer: Flexibility and expressiveness!

  • Different transformations learn different aspects
  • Q, K, V can focus on different features
  • Allows model to learn complex relationships
  • Gives model more parameters to optimize

Example:

  • Q might learn to look for “subjects”
  • K might learn to identify “verbs”
  • V might learn to extract “semantic meaning”

Scaled Dot-Product Attention

The Complete Mechanism

def scaled_dot_product_attention(Q, K, V, mask=None): """ Args: Q: Queries (batch, seq_len, d_k) K: Keys (batch, seq_len, d_k) V: Values (batch, seq_len, d_v) mask: Optional mask (batch, seq_len, seq_len) Returns: output: (batch, seq_len, d_v) attention_weights: (batch, seq_len, seq_len) """ d_k = Q.shape[-1] # Compute attention scores scores = Q @ K.transpose(-2, -1) / sqrt(d_k) # Apply mask (for padding or causality) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) # Softmax to get attention weights attention_weights = softmax(scores, dim=-1) # Apply weights to values output = attention_weights @ V return output, attention_weights

Masking

Padding Mask: Ignore padded tokens

# Sentence: "The cat <PAD> <PAD>" mask = [[1, 1, 0, 0]] # Only attend to real words

Causal Mask: Prevent looking at future tokens (for GPT)

# When predicting word 2, can only see words 0, 1 mask = [[1, 0, 0, 0], [1, 1, 0, 0], [1, 1, 1, 0], [1, 1, 1, 1]] # Lower triangular matrix

Multi-Head Attention

The Problem with Single Attention

Single attention head can only capture one type of relationship:

"The cat sat on the mat" Single head might learn: Subject-Verb relationships - "cat" → "sat" (subject-verb) But misses: - Spatial relationships: "sat" → "on" - Object relationships: "on" → "mat"

The Solution: Multiple Heads

Run attention multiple times in parallel, each learning different patterns:

Head 1: Subject-Verb relationships "The" → [0.1, 0.1, 0.8, 0.0, 0.0, 0.0] # Focuses on "sat" "cat" → [0.1, 0.7, 0.2, 0.0, 0.0, 0.0] # Focuses on itself and "sat" Head 2: Object-Preposition relationships "sat" → [0.0, 0.0, 0.1, 0.8, 0.1, 0.0] # Focuses on "on" "on" → [0.0, 0.0, 0.0, 0.1, 0.2, 0.7] # Focuses on "mat" Head 3: Positional relationships Each word attends to neighbors ... up to 8 or 12 heads

Implementation

def multi_head_attention(X, num_heads=8): """ Args: X: Input (batch, seq_len, d_model) num_heads: Number of attention heads Returns: output: (batch, seq_len, d_model) """ d_model = X.shape[-1] d_k = d_model // num_heads # Split dimensions across heads # Create Q, K, V for all heads at once Q = X @ W_q # (batch, seq_len, d_model) K = X @ W_k V = X @ W_v # Reshape to separate heads # (batch, seq_len, num_heads, d_k) Q = Q.reshape(batch, seq_len, num_heads, d_k) K = K.reshape(batch, seq_len, num_heads, d_k) V = V.reshape(batch, seq_len, num_heads, d_k) # Transpose to (batch, num_heads, seq_len, d_k) Q = Q.transpose(1, 2) K = K.transpose(1, 2) V = V.transpose(1, 2) # Apply attention for each head in parallel attention_output = scaled_dot_product_attention(Q, K, V) # Shape: (batch, num_heads, seq_len, d_k) # Concatenate heads attention_output = attention_output.transpose(1, 2) # Shape: (batch, seq_len, num_heads, d_k) attention_output = attention_output.reshape(batch, seq_len, d_model) # Shape: (batch, seq_len, d_model) # Final linear transformation output = attention_output @ W_o return output

Why It Works

Ensemble Effect: Multiple heads vote on what’s important

Specialized Roles: Each head can specialize:

  • Syntactic relationships (grammar)
  • Semantic relationships (meaning)
  • Positional relationships (nearby words)
  • Long-range dependencies (distant words)

Empirical Success:

  • GPT-3: 96 attention heads
  • BERT-base: 12 heads
  • BERT-large: 16 heads

Practical Examples

Example 1: Machine Translation

English: "I love machine learning" French: "J'adore l'apprentissage automatique" When generating "apprentissage": Attention weights to English words: I [0.05] love [0.10] machine [0.35] ████████ learning [0.50] ████████████ Model attends heavily to "machine learning" when generating "apprentissage"

Example 2: Question Answering

Question: "Who invented the transformer?" Context: "The transformer architecture was invented by Vaswani et al. in 2017 at Google Brain. The paper 'Attention is All You Need' introduced this revolutionary architecture." Attention when answering: architecture [0.20] ████ invented [0.30] ████████ by [0.05] █ Vaswani [0.35] █████████ et [0.05] █ al [0.05] █ Answer: "Vaswani et al."

Example 3: Sentiment Analysis

Review: "The movie was good but the ending was terrible" When predicting sentiment: Attention weights: The [0.05] movie [0.15] ███ was [0.02] good [0.25] ██████ but [0.08] ██ the [0.02] ending [0.18] ████ was [0.02] terrible [0.23] ██████ Model focuses on: "good" and "terrible" Result: Mixed sentiment (conflicting signals)

Why It Works So Well

1. Parallel Processing

RNN (Sequential):

Time: T₁ → T₂ → T₃ → T₄ → T₅ Must wait for each step to complete

Attention (Parallel):

All positions computed simultaneously Time: T₁ (single forward pass for entire sequence) 100x faster training on GPUs

2. No Information Bottleneck

RNN: Compresses everything into fixed-size hidden state Attention: Direct access to all positions, no compression needed

3. Better Gradients

RNN: Gradients must flow through many timesteps (vanishing/exploding) Attention: Direct paths from output to any input (stable gradients)

4. Interpretability

Can visualize attention weights to see what model focuses on:

import matplotlib.pyplot as plt import seaborn as sns sns.heatmap(attention_weights, xticklabels=words, yticklabels=words) plt.title("Attention Weights") plt.show()

5. Transfer Learning

Pre-trained attention models (BERT, GPT) transfer well to new tasks:

  • Learn general language understanding
  • Fine-tune on specific tasks
  • Requires less task-specific data

Summary

Attention is the key innovation that enabled:

  • Modern language models (GPT, BERT, T5)
  • Vision transformers
  • Multi-modal models
  • State-of-the-art results across domains

Core concepts to remember:

  1. Attention weights determine what to focus on
  2. QKV mechanism allows flexible learning
  3. Multi-head attention captures multiple relationships
  4. Parallel processing makes training fast
  5. Direct connections enable long-range dependencies

Next: See how attention is used in the complete transformer architecture → transformer_architecture.md

Last updated on