uwu Linear Attention 101 | ᚲ kenaz

Reading time: 5 min Prerequisites: None. A hunter needs only courage. Survival rate: 100% (no GPUs were sacrificed to the Great Ones)

The Problem (Why You Should Care)

You know how AI models choke on long documents?

You: *pastes 200 page PDF*
AI:  "I can only see the first 8,000 tokens"
You: "But I need you to read ALL of it"
AI:  "Sorry, memory full 💀"

That’s the quadratic curse.

In Bloodborne terms: you walked into Rom’s arena with 10,000 spiders and your PS4 burst into flames.

The Monster’s Attack Pattern

TRANSFORMER ATTENTION:
Every token looks at EVERY other token.

Like every entity in a boss fight calculating
its relationship to every OTHER entity:

10 tokens:     10 × 10 = 100 comparisons        ✅ fine
100 tokens:    100 × 100 = 10,000               ✅ okay
1,000 tokens:  1,000 × 1,000 = 1,000,000        😰 getting hot
10,000 tokens: 10k × 10k = 100,000,000          🔥 GPU sweating
100k tokens:   100k × 100k = 10,000,000,000     💀 GPU dead

See the pattern?

O(n²) = more spiders = EXPONENTIALLY more pain

The Bloodborne Analogy

Think of Rom’s boss fight:

SMALL FIGHT (10 spiders):
- You track each spider
- Each spider tracks you and other spiders
- 10 × 10 = 100 calculations
- Manageable

MEDIUM FIGHT (100 spiders):
- Everyone tracking everyone
- 100 × 100 = 10,000 calculations
- Your brain hurts

NIGHTMARE FIGHT (10,000 spiders):
- Everyone tracking everyone???
- 10k × 10k = 100,000,000 calculations
- Your PS4 melts. You die. Rom wins.

That's your GPU at long context.
It's not weak. The arena is just too full.

The Scary Rune (Math Version)

The paper says:

Attention(Q,K,V) = softmax(QK^T / √d_k)V

Memory: O(n²)
Compute: O(n²)

Tamed version:

Q = "what am I looking for?" (your hunter instinct)
K = "what does each spider have?" (their threat signature)
V = "what's the actual danger?" (their attack pattern)

QK^T = "compare EVERYTHING to EVERYTHING"
     = n × n matrix
     = quadratic explosion 💥

Even simpler: “Every spider must evaluate every other spider. More spiders = exponentially more calculations.”

The Visualization

TRANSFORMER (Quadratic):
+---------------------------------+
|  Spider 1 ↔ Spider 2 ↔ Spider 3 |
|     ↕          ↕          ↕     |
|  Spider 4 ↔ Spider 5 ↔ Spider 6 |
|     ↕          ↕          ↕     |
|  Spider 7 ↔   You   ↔ Spider 8  |
|                                 |
|  Everyone connects to everyone  |
|  9 entities = 81 connections    |
|  n entities = n² connections    |
+---------------------------------+

LINEAR ATTENTION (The Hunter's Dream):
+---------------------------------+
|                                 |
|  Spider 1 → [DREAM] → Spider 2  |
|              ↓                  |
|  Spider 3 → [DREAM] → Spider 4  |
|              ↓                  |
|  Spider 5 → [DREAM] →   You     |
|                                 |
|  Everyone talks to ONE memory   |
|  9 entities = 9 updates         |
|  n entities = n updates         |
+---------------------------------+

The Weapon: Associativity (The Trick Weapon)

Here’s the insight. Traditional attention computes:

(Q × Kᵀ) × V  →  builds n×n matrix first  →  O(n²)

Linear attention just… changes the parentheses:

Q × (Kᵀ × V)  →  builds d×d matrix first  →  O(n)

Tamed version:

d = hidden dimension (small, like 64)
n = sequence length (huge, like 100,000)

Traditional: build a 100,000 × 100,000 matrix
Linear:      build a 64 × 64 matrix

Same math. Different order. Trick weapon transformed.

Even simpler: “Instead of tracking every spider individually, maintain a single ‘summary stone’ in the Hunter’s Dream. Query it once.”

The Kill: Why It’s O(n) Now

TRANSFORMER:
- Step 1: Build n × n attention matrix (everyone vs everyone)
- Memory needed: grows with n²
- 1M tokens = 1 trillion comparisons
- GPU: "I'm literally dying"

LINEAR ATTENTION:
- Step 1: Each token deposits into fixed-size memory
- Step 2: Query the memory once
- Memory needed: CONSTANT (same size always)
- 1M tokens = 1M deposits to fixed memory
- GPU: "Is that all you got?"

The Numbers That Matter

DAMAGE COMPARISON @ 100,000 TOKENS:

TRANSFORMER:
- Operations: 100k × 100k = 10 BILLION
- Memory: grows to gigabytes
- Result: needs A100 cluster

LINEAR:
- Operations: 100k × 64 = 6.4 MILLION
- Memory: constant (megabytes)
- Result: runs on your gaming rig

TL;DR

Concept	Hunter Translation
O(n²) attention	Every spider tracks every spider
O(n) attention	Every spider reports to the Dream
n × n matrix	Full arena chaos
Fixed memory	The Hunter’s Dream (one summary)
Associativity	Trick weapon transformation
(Q × Kᵀ) × V	Saw Cleaver normal mode
Q × (Kᵀ × V)	Saw Cleaver transformed mode

The Catch (Honest Section)

Why isn't everyone using Linear Attention?

1. You lose the softmax (sharp "which spider NOW?" precision)
2. It's an approximation, not exact equivalence
3. Training dynamics are different
4. Some tasks need true quadratic attention
5. The ecosystem (llama.cpp, etc) is still catching up

But for long context? The Dream remembers enough.
The curse IS breakable.

You Survived!

You now understand attention scaling better than most ML engineers who just tweet about it.

The quadratic curse seemed unbeatable because:

O(n²) sounds scary
The papers hide it in dense notation
Nobody explained it through Bloodborne

But now you know:

It’s just “everyone tracks everyone”
Linear attention says “everyone reports to one Dream”
The curse was never inevitable. Just the first design.

A hunter is never alone.

rune.みんな ᚲ kenaz - the torch