Reading time: 5 min Prerequisites: None. A hunter needs only courage. Survival rate: 100% (no GPUs were sacrificed to the Great Ones)
The Problem (Why You Should Care)
You know how AI models choke on long documents?
You: *pastes 200 page PDF*
AI: "I can only see the first 8,000 tokens"
You: "But I need you to read ALL of it"
AI: "Sorry, memory full 💀"
That’s the quadratic curse.
In Bloodborne terms: you walked into Rom’s arena with 10,000 spiders and your PS4 burst into flames.
The Monster’s Attack Pattern
TRANSFORMER ATTENTION:
Every token looks at EVERY other token.
Like every entity in a boss fight calculating
its relationship to every OTHER entity:
10 tokens: 10 × 10 = 100 comparisons ✅ fine
100 tokens: 100 × 100 = 10,000 ✅ okay
1,000 tokens: 1,000 × 1,000 = 1,000,000 😰 getting hot
10,000 tokens: 10k × 10k = 100,000,000 🔥 GPU sweating
100k tokens: 100k × 100k = 10,000,000,000 💀 GPU dead
See the pattern?
O(n²) = more spiders = EXPONENTIALLY more pain
The Bloodborne Analogy
Think of Rom’s boss fight:
SMALL FIGHT (10 spiders):
- You track each spider
- Each spider tracks you and other spiders
- 10 × 10 = 100 calculations
- Manageable
MEDIUM FIGHT (100 spiders):
- Everyone tracking everyone
- 100 × 100 = 10,000 calculations
- Your brain hurts
NIGHTMARE FIGHT (10,000 spiders):
- Everyone tracking everyone???
- 10k × 10k = 100,000,000 calculations
- Your PS4 melts. You die. Rom wins.
That's your GPU at long context.
It's not weak. The arena is just too full.
The Scary Rune (Math Version)
The paper says:
Attention(Q,K,V) = softmax(QK^T / √d_k)V
Memory: O(n²)
Compute: O(n²)
Tamed version:
Q = "what am I looking for?" (your hunter instinct)
K = "what does each spider have?" (their threat signature)
V = "what's the actual danger?" (their attack pattern)
QK^T = "compare EVERYTHING to EVERYTHING"
= n × n matrix
= quadratic explosion 💥
Even simpler: “Every spider must evaluate every other spider. More spiders = exponentially more calculations.”
The Visualization
TRANSFORMER (Quadratic):
+---------------------------------+
| Spider 1 ↔ Spider 2 ↔ Spider 3 |
| ↕ ↕ ↕ |
| Spider 4 ↔ Spider 5 ↔ Spider 6 |
| ↕ ↕ ↕ |
| Spider 7 ↔ You ↔ Spider 8 |
| |
| Everyone connects to everyone |
| 9 entities = 81 connections |
| n entities = n² connections |
+---------------------------------+
LINEAR ATTENTION (The Hunter's Dream):
+---------------------------------+
| |
| Spider 1 → [DREAM] → Spider 2 |
| ↓ |
| Spider 3 → [DREAM] → Spider 4 |
| ↓ |
| Spider 5 → [DREAM] → You |
| |
| Everyone talks to ONE memory |
| 9 entities = 9 updates |
| n entities = n updates |
+---------------------------------+
The Weapon: Associativity (The Trick Weapon)
Here’s the insight. Traditional attention computes:
(Q × Kᵀ) × V → builds n×n matrix first → O(n²)
Linear attention just… changes the parentheses:
Q × (Kᵀ × V) → builds d×d matrix first → O(n)
Tamed version:
d = hidden dimension (small, like 64)
n = sequence length (huge, like 100,000)
Traditional: build a 100,000 × 100,000 matrix
Linear: build a 64 × 64 matrix
Same math. Different order. Trick weapon transformed.
Even simpler: “Instead of tracking every spider individually, maintain a single ‘summary stone’ in the Hunter’s Dream. Query it once.”
The Kill: Why It’s O(n) Now
TRANSFORMER:
- Step 1: Build n × n attention matrix (everyone vs everyone)
- Memory needed: grows with n²
- 1M tokens = 1 trillion comparisons
- GPU: "I'm literally dying"
LINEAR ATTENTION:
- Step 1: Each token deposits into fixed-size memory
- Step 2: Query the memory once
- Memory needed: CONSTANT (same size always)
- 1M tokens = 1M deposits to fixed memory
- GPU: "Is that all you got?"
The Numbers That Matter
DAMAGE COMPARISON @ 100,000 TOKENS:
TRANSFORMER:
- Operations: 100k × 100k = 10 BILLION
- Memory: grows to gigabytes
- Result: needs A100 cluster
LINEAR:
- Operations: 100k × 64 = 6.4 MILLION
- Memory: constant (megabytes)
- Result: runs on your gaming rig
TL;DR
| Concept | Hunter Translation |
|---|---|
| O(n²) attention | Every spider tracks every spider |
| O(n) attention | Every spider reports to the Dream |
| n × n matrix | Full arena chaos |
| Fixed memory | The Hunter’s Dream (one summary) |
| Associativity | Trick weapon transformation |
| (Q × Kᵀ) × V | Saw Cleaver normal mode |
| Q × (Kᵀ × V) | Saw Cleaver transformed mode |
The Catch (Honest Section)
Why isn't everyone using Linear Attention?
1. You lose the softmax (sharp "which spider NOW?" precision)
2. It's an approximation, not exact equivalence
3. Training dynamics are different
4. Some tasks need true quadratic attention
5. The ecosystem (llama.cpp, etc) is still catching up
But for long context? The Dream remembers enough.
The curse IS breakable.
You Survived!
You now understand attention scaling better than most ML engineers who just tweet about it.
The quadratic curse seemed unbeatable because:
- O(n²) sounds scary
- The papers hide it in dense notation
- Nobody explained it through Bloodborne
But now you know:
- It’s just “everyone tracks everyone”
- Linear attention says “everyone reports to one Dream”
- The curse was never inevitable. Just the first design.
A hunter is never alone.
rune.みんな ᚲ kenaz - the torch