uwu Quadratic Attention 101 | ᚲ kenaz

Reading time: 6 min Prerequisites: None. We got you. Survival rate: 100% (no GPU was harmed in this explanation)

The Problem (Why You Should Care)

You know how ChatGPT forgets what you said 50 messages ago?

You: "Remember my name is Dave"
... 100 messages later ...
You: "What's my name?"
AI:  "I don't have access to previous conversations,
      but I'd be happy to help!"
You: *dies inside*

That’s because of the quadratic curse.

The more you write, the more it costs to remember.

And the cost doesn’t just grow… it EXPLODES.

The Monster’s Attack Pattern

TRANSFORMER ATTENTION:
Every token looks at EVERY other token.

10 tokens:     10 × 10 = 100 comparisons        ✅ fine
100 tokens:    100 × 100 = 10,000               ✅ okay
1,000 tokens:  1,000 × 1,000 = 1,000,000        😰 getting hot
10,000 tokens: 10k × 10k = 100,000,000          🔥 GPU sweating
100k tokens:   100k × 100k = 10,000,000,000     💀 GPU dead

See the pattern?

O(n²) = the more you write, the FASTER the pain grows

The Everyday Analogy

Think of a party:

SMALL PARTY (10 people):
- Everyone can talk to everyone
- 10 × 10 = 100 possible conversations
- Manageable

MEDIUM PARTY (100 people):
- Everyone tries to talk to everyone
- 100 × 100 = 10,000 conversations
- Getting loud

MEGA PARTY (10,000 people):
- Everyone talking to everyone???
- 10k × 10k = 100,000,000 conversations
- CHAOS. Nobody can think.

That's your GPU at long context.
It's not weak. The party is just too big.

The Scary Rune (Math Version)

The paper says:

Attention(Q,K,V) = softmax(QK^T / √d_k)V

Memory: O(n²)
Compute: O(n²)

Tamed version:

Q = "what am I looking for?"
K = "what does each token have?"
V = "what's the actual content?"

QK^T = "compare EVERYTHING to EVERYTHING"
      = n × n matrix
      = quadratic explosion 💥

Even simpler: “Everyone has to shake hands with everyone else. More people = exponentially more handshakes.”

The Visualization

TRANSFORMER (Quadratic):
+---------------------------------+
|  Token 1 ↔ Token 2 ↔ Token 3   |
   ↕         ↕         ↕       |
|  Token 4 ↔ Token 5 ↔ Token 6   |
   ↕         ↕         ↕       |
|  Token 7 ↔ Token 8 ↔ Token 9   |
                                |
|  Everyone connects to everyone  |
|  9 tokens = 81 connections      |
|  n tokens = n² connections      |
+---------------------------------+

LINEAR ATTENTION (What we want):
+---------------------------------+
                                |
|  Token 1 → [MEMORY] → Token 2  |
             ↓                  |
|  Token 3 → [MEMORY] → Token 4  |
             ↓                  |
|  Token 5 → [MEMORY] → Token 6  |
                                |
|  Everyone talks to ONE memory   |
|  9 tokens = 9 updates           |
|  n tokens = n updates           |
+---------------------------------+

The Weapon: Delta Attention

Kimi said: “What if we just… didn’t compare everything to everything?”

DELTA ATTENTION FORMULA:
S_t = S_{t-1} + β_t ⊗ (v_t k_t^T - S_{t-1})

Tamed version:

S_t       = new memory state
S_{t-1}   = old memory state (what we knew)
β_t       = "how much should I update?" (0 to 1)
v_t k_t^T = "what's the new info?"
- S_{t-1} = "minus what I already knew" (the DELTA!)

Even simpler: “Take old memory. Figure out what’s NEW. Update a little bit.”

The β Gate (The Bouncer)

β = sigmoid(learned_weights · input)

β ≈ 0: "Nah, this input is noise. Keep old memory."
β ≈ 1: "YO this is important! UPDATE EVERYTHING!"

It’s literally a bouncer at a club deciding what memories get in.

BOUNCER LOGIC:
- "Dave said his name" → β = 0.9 → IMPORTANT, remember!
- "um, like, you know" → β = 0.1 → noise, ignore
- Model LEARNS what matters

The Kill: Why It’s O(n) Now

TRANSFORMER:
- Step 1: Build n × n attention matrix
- Memory needed: grows with n²
- 1M tokens = 1 trillion comparisons
- GPU: "I'm literally dying"

DELTA ATTENTION:
- Step 1: Look at memory state (fixed size!)
- Step 2: Update memory state (constant cost!)
- Memory needed: CONSTANT (same size always)
- 1M tokens = 1M updates to fixed memory
- GPU: "Is that all you got?"

The Numbers That Matter

DAMAGE COMPARISON @ 1 MILLION TOKENS:

TRANSFORMER:
- Operations: 1,000,000 × 1,000,000 = 1 TRILLION
- Memory: grows to terabytes
- Result: impossible without datacenter

DELTA:
- Operations: 1,000,000 × (fixed memory)
- Memory: constant (megabytes)
- Result: runs on your laptop

TL;DR

Concept	Tamed Version
O(n²) attention	Pain grows FAST (quadratic)
O(n) attention	Pain grows SLOW (linear)
n × n matrix	Everyone shaking hands with everyone
Fixed memory	One shared notebook everyone updates
Delta	”What’s different from last time”
β gate	Bouncer deciding what’s important
KV cache	The notebook getting bigger and bigger
State space	The notebook staying same size

The Catch (Honest Section)

Why isn't everyone using Delta Attention?

1. It's NEW (2024-2025)
2. Needs custom CUDA kernels
3. llama.cpp doesn't support it yet (Issue #16930)
4. Training is different than transformers
5. The ecosystem is still catching up

But the curse IS breakable.
The math proves it.
The beasts exist.

You Survived!

You now understand attention scaling better than most ML engineers on Twitter.

The quadratic curse seemed unbeatable because:

O(n²) sounds scary
The papers use dense notation
Nobody explained WHY it explodes

But now you know:

It’s just “everyone talks to everyone”
Linear attention says “everyone talks to ONE memory”
The curse was never inevitable. Just the first design.

rune.みんな ᚲ kenaz - the torch