Reading time: 10 min Prerequisites: Know what attention is. Vaguely. Survival rate: 100% (the goose is friendly)
The Quest (Why You Should Care)
Transformers have a memory problem.
TRANSFORMER MEMORY:
- remembers EVERYTHING in context
- costs O(n²) to remember
- context window = hard limit
- 1M tokens = 💀 your GPU
RWKV-7 MEMORY:
- fixed-size "notebook"
- costs O(n) to remember
- can process INFINITE tokens
- 1M tokens = same as 1 token
But v7 does something v6 couldn’t:
It can EDIT its notebook. Not just add—CORRECT.
The Hunt Board Notice
QUEST TYPE: CAPTURE - STUDY - IMPLEMENT
"A new beast has emerged from the Eastern servers.
RWKV-7, codename 'Goose'.
Unlike its ancestors (v5, v6), this one carries
a strange power: the Generalized Delta Rule.
It can rewrite its own memories mid-inference.
Test-time training. In production.
Capture its essence. Port to Rust."
The v6 Beast: What It Could Do
RWKV-6 MEMORY UPDATE:
notebook = fade_old × old_memory + new_note
That's it. Two operations:
1. Fade what's old
2. Add what's new
Simple. Effective. But LIMITED.
The problem: you can’t FIX a wrong note. Once written, it only fades. Never corrects.
The v7 Beast: The Delta Rule
RWKV-7 MEMORY UPDATE:
notebook = fade_old × old_memory
+ CORRECT(old_memory) ← NEW!
+ new_note
Three operations:
1. Fade what's old
2. CORRECT what was wrong ← THE MAGIC
3. Add what's new
The Math (Scary Rune)
V6 (old):
S_t = w × S_{t-1} + k ⊗ v
V7 (new):
S_t = w × S_{t-1} + S_{t-1} × α × β^T + k ⊗ v
^^^^^^^^^^^^^^^^^^^^
THE DELTA RULE TERM
That middle term is doing gradient descent on the memory itself.
Tamed Version
What Each Term Does
w × S_{t-1}
→ "fade the old stuff"
→ decay between 0.545 and 1.0
→ gentle forgetting
S_{t-1} × α × β^T
→ "correct the old stuff"
→ α = what to look for (negative normalized key)
→ β = what to write back (key × learning rate)
→ this IS gradient descent!
k ⊗ v
→ "add new stuff"
→ outer product of key and value
→ standard attention write
The Gradient Descent Insight
Here’s the wild part:
The delta rule term is literally:
state = state + learning_rate × gradient
Where:
- gradient = how wrong is the current prediction?
- learning_rate = how much to fix it?
THE MODEL IS TRAINING ITSELF DURING INFERENCE.
This is why RWKV-7 can:
- Track state (counting, position)
- Learn patterns on the fly
- Recognize ALL regular languages (provably!)
The Decay Formula (FLA-style)
OLD FORMULA (wrong):
w_decay = exp(-exp(w))
→ decay in [0, 1] unbounded
FLA FORMULA (correct):
w = -0.6065 × sigmoid(w_lora(x))
w_decay = exp(w)
→ decay in [0.545, 1.0] bounded
WHY BOUNDED?
- Too much decay = forgets everything
- Too little decay = remembers garbage
- Sweet spot: 0.545 to 1.0
v_first: Cross-Layer Memory
RWKV-7 has a sneaky trick:
Layer 0:
v_first = v (capture initial value)
Layer 1+:
v = v + (v_first - v) × sigmoid(v_lora)
= interpolation between original and current!
WHY?
- Later layers can "remember" what layer 0 saw
- Cross-layer information flow
- Skip connections for values
The Architecture (FLA Weight Names)
model.layers.{i}.attn.*
├── r_proj, k_proj, v_proj, o_proj (projections)
├── x_r, x_w, x_k, x_v, x_a, x_g (time mixing)
├── k_k, k_a, r_k (v7 normalization)
├── w_lora (decay, tanh, bias)
├── a_lora (alpha, sigmoid, bias)
├── v_lora (value blend, layer 1+)
├── g_lora (gate, sigmoid, no bias)
└── g_norm (group normalization)
model.layers.{i}.ffn.*
├── key (up projection)
├── value (down projection)
└── x_k (time mixing)
The Implementation
// The delta rule core loop
for t in 0..seq_len {
// 1. Fade old state
// w_decay in [0.545, 1.0]
state = w_decay * state;
// 2. Delta rule correction
// sa = state @ (-kk) // what matches current key?
// sab = sa outer (kk * a) // correction term
state = state + sab;
// 3. Add new key-value
// vk = v outer k
state = state + vk;
// 4. Output
output[t] = state @ r;
}
State Dimensions (The Tricky Part)
CORRECT DIMENSIONS:
state: [B, H, K, V]
- B = batch
- H = heads
- K = key dimension (rows)
- V = value dimension (columns)
Operations:
- state @ r: sum over K (rows), output is V
- sa = state @ (-kk): sum over K, output is V
- sab = sa outer beta: [V] × [K] → [K, V]
- vk = v outer k: [V] × [K] → [K, V]
WRONG (common bug):
- Swapping K and V dimensions
- Using [V, K] instead of [K, V]
- Output becomes garbage
Why O(n) Matters
TRANSFORMER:
- Each token attends to ALL previous tokens
- 1000 tokens × 1000 comparisons = 1M ops
- 10000 tokens × 10000 = 100M ops
- QUADRATIC
RWKV-7:
- Each token updates FIXED-SIZE state
- 1000 tokens × 1 state update = 1000 ops
- 10000 tokens × 1 update = 10000 ops
- LINEAR
For 1M token context:
- Transformer: impossible
- RWKV-7: same speed as 1 token
The Patch
This implementation adds RWKV-7 “Goose” to candle.
Files added:
candle-transformers/src/models/rwkv_v7.rs- Full model (1356 lines)candle-examples/examples/rwkv7/- Example usage
Features:
- FLA-compatible weight loading
- Proper delta rule implementation
- v_first cross-layer dependency
- Bounded decay (0.545-1.0)
- Extensive TDD tests
Get it:
- Source: rwkv7-candle.tar.gz (signature)
- Patch: slain-rwkv7.patch (signature)
Verify:
curl https://rune.みんな/key.asc | gpg --import
gpg --verify slain-rwkv7.patch.asc
TL;DR
| Aspect | RWKV-6 | RWKV-7 |
|---|---|---|
| Memory update | fade + add | fade + CORRECT + add |
| Delta rule | ❌ | ✓ |
| Can edit memory | ❌ | ✓ |
| Test-time training | ❌ | ✓ |
| Regular languages | limited | ALL |
| Complexity | O(n) | O(n) |
| State | fixed | fixed |
Key insight: v7 can rewrite its memory, not just add to it.
You Survived!
You now understand:
- Why RWKV matters (O(n) vs O(n²))
- What the delta rule does (gradient descent on memory)
- How v7 differs from v6 (correction term)
- Why decay is bounded (prevent explosion/vanishing)
- How v_first works (cross-layer memory)
The beast is tamed. The goose flies in Rust.
Rune QQ ᚲ kenaz - the torch that illuminates