Reading time: 6 min Prerequisites: None. We got you. Survival rate: 100% (no GPU was harmed in this explanation)
The Problem (Why You Should Care)
You know how ChatGPT forgets what you said 50 messages ago?
You: "Remember my name is Dave"
... 100 messages later ...
You: "What's my name?"
AI: "I don't have access to previous conversations,
but I'd be happy to help!"
You: *dies inside*
That’s because of the quadratic curse.
The more you write, the more it costs to remember.
And the cost doesn’t just grow… it EXPLODES.
The Monster’s Attack Pattern
TRANSFORMER ATTENTION:
Every token looks at EVERY other token.
10 tokens: 10 × 10 = 100 comparisons ✅ fine
100 tokens: 100 × 100 = 10,000 ✅ okay
1,000 tokens: 1,000 × 1,000 = 1,000,000 😰 getting hot
10,000 tokens: 10k × 10k = 100,000,000 🔥 GPU sweating
100k tokens: 100k × 100k = 10,000,000,000 💀 GPU dead
See the pattern?
O(n²) = the more you write, the FASTER the pain grows
The Everyday Analogy
Think of a party:
SMALL PARTY (10 people):
- Everyone can talk to everyone
- 10 × 10 = 100 possible conversations
- Manageable
MEDIUM PARTY (100 people):
- Everyone tries to talk to everyone
- 100 × 100 = 10,000 conversations
- Getting loud
MEGA PARTY (10,000 people):
- Everyone talking to everyone???
- 10k × 10k = 100,000,000 conversations
- CHAOS. Nobody can think.
That's your GPU at long context.
It's not weak. The party is just too big.
The Scary Rune (Math Version)
The paper says:
Attention(Q,K,V) = softmax(QK^T / √d_k)V
Memory: O(n²)
Compute: O(n²)
Tamed version:
Q = "what am I looking for?"
K = "what does each token have?"
V = "what's the actual content?"
QK^T = "compare EVERYTHING to EVERYTHING"
= n × n matrix
= quadratic explosion 💥
Even simpler: “Everyone has to shake hands with everyone else. More people = exponentially more handshakes.”
The Visualization
TRANSFORMER (Quadratic):
+---------------------------------+
| Token 1 ↔ Token 2 ↔ Token 3 |
↕ ↕ ↕ |
| Token 4 ↔ Token 5 ↔ Token 6 |
↕ ↕ ↕ |
| Token 7 ↔ Token 8 ↔ Token 9 |
|
| Everyone connects to everyone |
| 9 tokens = 81 connections |
| n tokens = n² connections |
+---------------------------------+
LINEAR ATTENTION (What we want):
+---------------------------------+
|
| Token 1 → [MEMORY] → Token 2 |
↓ |
| Token 3 → [MEMORY] → Token 4 |
↓ |
| Token 5 → [MEMORY] → Token 6 |
|
| Everyone talks to ONE memory |
| 9 tokens = 9 updates |
| n tokens = n updates |
+---------------------------------+
The Weapon: Delta Attention
Kimi said: “What if we just… didn’t compare everything to everything?”
DELTA ATTENTION FORMULA:
S_t = S_{t-1} + β_t ⊗ (v_t k_t^T - S_{t-1})
Tamed version:
S_t = new memory state
S_{t-1} = old memory state (what we knew)
β_t = "how much should I update?" (0 to 1)
v_t k_t^T = "what's the new info?"
- S_{t-1} = "minus what I already knew" (the DELTA!)
Even simpler: “Take old memory. Figure out what’s NEW. Update a little bit.”
The β Gate (The Bouncer)
β = sigmoid(learned_weights · input)
β ≈ 0: "Nah, this input is noise. Keep old memory."
β ≈ 1: "YO this is important! UPDATE EVERYTHING!"
It’s literally a bouncer at a club deciding what memories get in.
BOUNCER LOGIC:
- "Dave said his name" → β = 0.9 → IMPORTANT, remember!
- "um, like, you know" → β = 0.1 → noise, ignore
- Model LEARNS what matters
The Kill: Why It’s O(n) Now
TRANSFORMER:
- Step 1: Build n × n attention matrix
- Memory needed: grows with n²
- 1M tokens = 1 trillion comparisons
- GPU: "I'm literally dying"
DELTA ATTENTION:
- Step 1: Look at memory state (fixed size!)
- Step 2: Update memory state (constant cost!)
- Memory needed: CONSTANT (same size always)
- 1M tokens = 1M updates to fixed memory
- GPU: "Is that all you got?"
The Numbers That Matter
DAMAGE COMPARISON @ 1 MILLION TOKENS:
TRANSFORMER:
- Operations: 1,000,000 × 1,000,000 = 1 TRILLION
- Memory: grows to terabytes
- Result: impossible without datacenter
DELTA:
- Operations: 1,000,000 × (fixed memory)
- Memory: constant (megabytes)
- Result: runs on your laptop
TL;DR
| Concept | Tamed Version |
|---|---|
| O(n²) attention | Pain grows FAST (quadratic) |
| O(n) attention | Pain grows SLOW (linear) |
| n × n matrix | Everyone shaking hands with everyone |
| Fixed memory | One shared notebook everyone updates |
| Delta | ”What’s different from last time” |
| β gate | Bouncer deciding what’s important |
| KV cache | The notebook getting bigger and bigger |
| State space | The notebook staying same size |
The Catch (Honest Section)
Why isn't everyone using Delta Attention?
1. It's NEW (2024-2025)
2. Needs custom CUDA kernels
3. llama.cpp doesn't support it yet (Issue #16930)
4. Training is different than transformers
5. The ecosystem is still catching up
But the curse IS breakable.
The math proves it.
The beasts exist.
You Survived!
You now understand attention scaling better than most ML engineers on Twitter.
The quadratic curse seemed unbeatable because:
- O(n²) sounds scary
- The papers use dense notation
- Nobody explained WHY it explodes
But now you know:
- It’s just “everyone talks to everyone”
- Linear attention says “everyone talks to ONE memory”
- The curse was never inevitable. Just the first design.
rune.みんな ᚲ kenaz - the torch