Attention Mechanism: The Breakthrough Innovation

Why Attention Was Invented
The Core Concept
Self-Attention Step by Step
Query, Key, Value (QKV)
Scaled Dot-Product Attention
Multi-Head Attention
Practical Examples
Why It Works So Well

Why Attention Was Invented

The Problem with RNNs

Before attention, sequence models (RNNs, LSTMs) had a fundamental limitation:


Sentence: "The cat that ate the mouse that lived in the barn ran away"

Problem: By the time model gets to "ran", information about "cat" 
has been compressed through many timesteps and starts to fade.

RNN limitations:

❌ Sequential processing (slow)
❌ Information bottleneck (fixed-size hidden state)
❌ Vanishing gradients for long sequences
❌ Cannot look back at earlier words directly

The Solution: Attention

Key Innovation: Let the model look back at ANY previous word when processing current word.


When predicting "ran":
- Attention can directly look at "cat" (high attention)
- Also look at "that", "ate", etc. (lower attention)
- Weights determine how much to focus on each word

No information loss!

Attention benefits:

✅ Direct access to any previous position
✅ Parallel processing possible
✅ No vanishing gradients
✅ Model learns what to focus on

The Core Concept

The Human Analogy

When you read this sentence: “The Eiffel Tower is in Paris, which is the capital of France.”

To answer “What city has the Eiffel Tower?”, you:

Scan the sentence
Attend to relevant words: “Eiffel Tower”, “Paris”
Ignore less relevant: “which”, “is”, “the”, “of”
Form answer from attended information

Attention mechanism does the same thing!

The Intuition


Question: "Where is the Eiffel Tower?"
Context:  "The Eiffel Tower is located in Paris, France."

Attention weights:
  The      [0.05]  ▁
  Eiffel   [0.30]  ████████
  Tower    [0.25]  ███████
  is       [0.02]  ▁
  located  [0.05]  █
  in       [0.03]  ▁
  Paris    [0.25]  ███████
  France   [0.05]  █

Model focuses heavily on: "Eiffel", "Tower", "Paris"

From Attention to Self-Attention

Regular Attention: Query one sequence, attend to another

Used in encoder-decoder models (translation)
Example: Query = “Wo ist der Eiffelturm?”, Attend to = “The Eiffel Tower is in Paris”

Self-Attention: Query and attend to the SAME sequence

Used in transformers (BERT, GPT)
Example: Each word attends to every word in same sentence
Helps understand relationships within text

Self-Attention Step by Step

Let’s build intuition with a simple example.

Example Sentence


"The cat sat"

Goal: For each word, compute attention to all words (including itself).

Step 1: Embed Words

Convert words to vectors (from Phase 5: Embeddings):


# Simplified 4-dimensional embeddings
embeddings = {
    "The": [0.2, 0.1, 0.5, 0.3],
    "cat": [0.5, 0.8, 0.2, 0.1],
    "sat": [0.1, 0.3, 0.9, 0.4]
}
 
# Stack into matrix
X = [[0.2, 0.1, 0.5, 0.3],  # The
     [0.5, 0.8, 0.2, 0.1],  # cat
     [0.1, 0.3, 0.9, 0.4]]  # sat
# Shape: (3, 4) - 3 words, 4 dimensions

Step 2: Create Q, K, V Matrices

We need three transformations of our input:


# Weight matrices (learned during training)
W_q = random_matrix(4, 4)  # Query weights
W_k = random_matrix(4, 4)  # Key weights
W_v = random_matrix(4, 4)  # Value weights
 
# Transform embeddings
Q = X @ W_q  # Query: What am I looking for?
K = X @ W_k  # Key: What do I contain?
V = X @ W_v  # Value: What do I output?
 
# All have shape (3, 4)

Intuition:

Query (Q): “What information do I need?”
Key (K): “What information do I have?”
Value (V): “What information do I output?”

Step 3: Compute Attention Scores

Measure similarity between queries and keys:


scores = Q @ K.T  # Matrix multiplication
# Shape: (3, 3)
 
# Example result:
scores = [[2.1, 1.5, 0.8],  # The attends to: The, cat, sat
          [1.5, 3.2, 1.9],  # cat attends to: The, cat, sat
          [0.8, 1.9, 2.7]]  # sat attends to: The, cat, sat

Interpretation:

scores[1, 0] = 1.5 means “cat” has score 1.5 when attending to “The”
scores[1, 1] = 3.2 means “cat” has highest score when attending to itself
Higher score = more relevant

Step 4: Scale Scores

Divide by square root of dimension to stabilize gradients:


d_k = 4  # Dimension of keys
scaled_scores = scores / sqrt(d_k)

Why scale?

Prevents very large values
Keeps softmax gradients well-behaved
Becomes more important with larger dimensions

Step 5: Apply Softmax

Convert scores to probability distribution:


attention_weights = softmax(scaled_scores)
 
# Example result:
attention_weights = [
    [0.55, 0.32, 0.13],  # The: 55% to itself, 32% to cat, 13% to sat
    [0.15, 0.60, 0.25],  # cat: 15% to The, 60% to itself, 25% to sat
    [0.10, 0.30, 0.60]   # sat: 10% to The, 30% to cat, 60% to itself
]

Properties:

Each row sums to 1.0
Represents how much to attend to each position
These are the famous “attention weights”

Step 6: Apply Weights to Values

Compute weighted sum of values:


output = attention_weights @ V
# Shape: (3, 4)
 
# For word "cat" (row 1):
output[1] = 0.15 * V[0] + 0.60 * V[1] + 0.25 * V[2]
#           ↑            ↑             ↑
#        from "The"  from "cat"   from "sat"

Result: Each word’s output is a weighted combination of all words’ values.

Complete Formula

Putting it all together:


Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:
- Q = queries (what I'm looking for)
- K = keys (what I have)
- V = values (what I output)
- d_k = dimension of keys
- / √d_k = scaling factor
- softmax = converts to probabilities

Query, Key, Value (QKV)

The Database Analogy

Think of attention like a database lookup:


# Database with (key, value) pairs
database = {
    "Paris": "Capital of France, home to Eiffel Tower",
    "London": "Capital of UK, home to Big Ben",
    "Tokyo": "Capital of Japan, largest city"
}
 
# Query
query = "Where is the Eiffel Tower?"
 
# Step 1: Match query to keys
scores = {
    "Paris": 0.85,   # High match!
    "London": 0.15,
    "Tokyo": 0.10
}
 
# Step 2: Softmax to get attention weights
attention = softmax(scores)  # [0.70, 0.17, 0.13]
 
# Step 3: Retrieve weighted combination of values
result = 0.70 * database["Paris"] + 
         0.17 * database["London"] + 
         0.13 * database["Tokyo"]
# Mostly Paris information!

In Neural Networks


# Input embeddings
X = word_embeddings  # Shape: (seq_len, d_model)
 
# Linear transformations (learned)
Q = X @ W_q  # "What to search for"
K = X @ W_k  # "How to identify relevant info"
V = X @ W_v  # "What info to pass forward"
 
# Attention computation
scores = Q @ K.T           # Similarity
weights = softmax(scores)  # Probability
output = weights @ V       # Weighted combination

Why Three Matrices?

Question: Why not just use X directly?

Answer: Flexibility and expressiveness!

Different transformations learn different aspects
Q, K, V can focus on different features
Allows model to learn complex relationships
Gives model more parameters to optimize

Example:

Q might learn to look for “subjects”
K might learn to identify “verbs”
V might learn to extract “semantic meaning”

Scaled Dot-Product Attention

The Complete Mechanism


def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Args:
        Q: Queries (batch, seq_len, d_k)
        K: Keys (batch, seq_len, d_k)
        V: Values (batch, seq_len, d_v)
        mask: Optional mask (batch, seq_len, seq_len)
    
    Returns:
        output: (batch, seq_len, d_v)
        attention_weights: (batch, seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.transpose(-2, -1) / sqrt(d_k)
    
    # Apply mask (for padding or causality)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Softmax to get attention weights
    attention_weights = softmax(scores, dim=-1)
    
    # Apply weights to values
    output = attention_weights @ V
    
    return output, attention_weights

Masking

Padding Mask: Ignore padded tokens


# Sentence: "The cat <PAD> <PAD>"
mask = [[1, 1, 0, 0]]  # Only attend to real words

Causal Mask: Prevent looking at future tokens (for GPT)


# When predicting word 2, can only see words 0, 1
mask = [[1, 0, 0, 0],
        [1, 1, 0, 0],
        [1, 1, 1, 0],
        [1, 1, 1, 1]]
# Lower triangular matrix

Multi-Head Attention

The Problem with Single Attention

Single attention head can only capture one type of relationship:


"The cat sat on the mat"

Single head might learn: Subject-Verb relationships
- "cat" → "sat" (subject-verb)

But misses:
- Spatial relationships: "sat" → "on"
- Object relationships: "on" → "mat"

The Solution: Multiple Heads

Run attention multiple times in parallel, each learning different patterns:


Head 1: Subject-Verb relationships
  "The" → [0.1, 0.1, 0.8, 0.0, 0.0, 0.0]  # Focuses on "sat"
  "cat" → [0.1, 0.7, 0.2, 0.0, 0.0, 0.0]  # Focuses on itself and "sat"

Head 2: Object-Preposition relationships
  "sat" → [0.0, 0.0, 0.1, 0.8, 0.1, 0.0]  # Focuses on "on"
  "on"  → [0.0, 0.0, 0.0, 0.1, 0.2, 0.7]  # Focuses on "mat"

Head 3: Positional relationships
  Each word attends to neighbors

... up to 8 or 12 heads

Implementation


def multi_head_attention(X, num_heads=8):
    """
    Args:
        X: Input (batch, seq_len, d_model)
        num_heads: Number of attention heads
    
    Returns:
        output: (batch, seq_len, d_model)
    """
    d_model = X.shape[-1]
    d_k = d_model // num_heads  # Split dimensions across heads
    
    # Create Q, K, V for all heads at once
    Q = X @ W_q  # (batch, seq_len, d_model)
    K = X @ W_k
    V = X @ W_v
    
    # Reshape to separate heads
    # (batch, seq_len, num_heads, d_k)
    Q = Q.reshape(batch, seq_len, num_heads, d_k)
    K = K.reshape(batch, seq_len, num_heads, d_k)
    V = V.reshape(batch, seq_len, num_heads, d_k)
    
    # Transpose to (batch, num_heads, seq_len, d_k)
    Q = Q.transpose(1, 2)
    K = K.transpose(1, 2)
    V = V.transpose(1, 2)
    
    # Apply attention for each head in parallel
    attention_output = scaled_dot_product_attention(Q, K, V)
    # Shape: (batch, num_heads, seq_len, d_k)
    
    # Concatenate heads
    attention_output = attention_output.transpose(1, 2)
    # Shape: (batch, seq_len, num_heads, d_k)
    
    attention_output = attention_output.reshape(batch, seq_len, d_model)
    # Shape: (batch, seq_len, d_model)
    
    # Final linear transformation
    output = attention_output @ W_o
    
    return output

Why It Works

Ensemble Effect: Multiple heads vote on what’s important

Specialized Roles: Each head can specialize:

Syntactic relationships (grammar)
Semantic relationships (meaning)
Positional relationships (nearby words)
Long-range dependencies (distant words)

Empirical Success:

GPT-3: 96 attention heads
BERT-base: 12 heads
BERT-large: 16 heads

Practical Examples

Example 1: Machine Translation


English: "I love machine learning"
French:  "J'adore l'apprentissage automatique"

When generating "apprentissage":
Attention weights to English words:
  I          [0.05]
  love       [0.10]
  machine    [0.35] ████████
  learning   [0.50] ████████████

Model attends heavily to "machine learning" when generating "apprentissage"

Example 2: Question Answering


Question: "Who invented the transformer?"
Context:  "The transformer architecture was invented by Vaswani et al. 
           in 2017 at Google Brain. The paper 'Attention is All You Need' 
           introduced this revolutionary architecture."

Attention when answering:
  architecture [0.20] ████
  invented     [0.30] ████████
  by           [0.05] █
  Vaswani      [0.35] █████████
  et           [0.05] █
  al           [0.05] █

Answer: "Vaswani et al."

Example 3: Sentiment Analysis


Review: "The movie was good but the ending was terrible"

When predicting sentiment:
Attention weights:
  The      [0.05]
  movie    [0.15] ███
  was      [0.02]
  good     [0.25] ██████
  but      [0.08] ██
  the      [0.02]
  ending   [0.18] ████
  was      [0.02]
  terrible [0.23] ██████

Model focuses on: "good" and "terrible"
Result: Mixed sentiment (conflicting signals)

Why It Works So Well

1. Parallel Processing

RNN (Sequential):


Time: T₁ → T₂ → T₃ → T₄ → T₅
Must wait for each step to complete

Attention (Parallel):


All positions computed simultaneously
Time: T₁ (single forward pass for entire sequence)
100x faster training on GPUs

2. No Information Bottleneck

RNN: Compresses everything into fixed-size hidden state Attention: Direct access to all positions, no compression needed

3. Better Gradients

RNN: Gradients must flow through many timesteps (vanishing/exploding) Attention: Direct paths from output to any input (stable gradients)

4. Interpretability

Can visualize attention weights to see what model focuses on:


import matplotlib.pyplot as plt
import seaborn as sns
 
sns.heatmap(attention_weights, 
            xticklabels=words, 
            yticklabels=words)
plt.title("Attention Weights")
plt.show()

5. Transfer Learning

Pre-trained attention models (BERT, GPT) transfer well to new tasks:

Learn general language understanding
Fine-tune on specific tasks
Requires less task-specific data

Summary

Attention is the key innovation that enabled:

Modern language models (GPT, BERT, T5)
Vision transformers
Multi-modal models
State-of-the-art results across domains

Core concepts to remember:

Attention weights determine what to focus on
QKV mechanism allows flexible learning
Multi-head attention captures multiple relationships
Parallel processing makes training fast
Direct connections enable long-range dependencies

Next: See how attention is used in the complete transformer architecture → transformer_architecture.md