uwu-rwkv7-101

2026-01-29 14:30 1130 words 6 min read

no table of contents
RWKV-7 "Goose" - the delta rule that teaches transformers to forget and REWRITE. Now in Rust. Signed patch for candle.

Reading time: 10 min Prerequisites: Know what attention is. Vaguely. Survival rate: 100% (the goose is friendly)


The Quest (Why You Should Care)

Transformers have a memory problem.

TRANSFORMER MEMORY:
- remembers EVERYTHING in context
- costs O(n²) to remember
- context window = hard limit
- 1M tokens = 💀 your GPU

RWKV-7 MEMORY:
- fixed-size "notebook"
- costs O(n) to remember
- can process INFINITE tokens
- 1M tokens = same as 1 token

But v7 does something v6 couldn’t:

It can EDIT its notebook. Not just add—CORRECT.


The Hunt Board Notice

QUEST TYPE: CAPTURE - STUDY - IMPLEMENT

"A new beast has emerged from the Eastern servers.
RWKV-7, codename 'Goose'.

Unlike its ancestors (v5, v6), this one carries
a strange power: the Generalized Delta Rule.

It can rewrite its own memories mid-inference.
Test-time training. In production.

Capture its essence. Port to Rust."

The v6 Beast: What It Could Do

RWKV-6 MEMORY UPDATE:

notebook = fade_old × old_memory + new_note

That's it. Two operations:
1. Fade what's old
2. Add what's new

Simple. Effective. But LIMITED.

The problem: you can’t FIX a wrong note. Once written, it only fades. Never corrects.


The v7 Beast: The Delta Rule

RWKV-7 MEMORY UPDATE:

notebook = fade_old × old_memory
         + CORRECT(old_memory)    ← NEW!
         + new_note

Three operations:
1. Fade what's old
2. CORRECT what was wrong     ← THE MAGIC
3. Add what's new

The Math (Scary Rune)

V6 (old):
S_t = w × S_{t-1} + k ⊗ v

V7 (new):
S_t = w × S_{t-1} + S_{t-1} × α × β^T + k ⊗ v
                    ^^^^^^^^^^^^^^^^^^^^
                    THE DELTA RULE TERM

That middle term is doing gradient descent on the memory itself.


Tamed Version

What Each Term Does

w × S_{t-1}
  → "fade the old stuff"
  → decay between 0.545 and 1.0
  → gentle forgetting

S_{t-1} × α × β^T
  → "correct the old stuff"
  → α = what to look for (negative normalized key)
  → β = what to write back (key × learning rate)
  → this IS gradient descent!

k ⊗ v
  → "add new stuff"
  → outer product of key and value
  → standard attention write

The Gradient Descent Insight

Here’s the wild part:

The delta rule term is literally:

state = state + learning_rate × gradient

Where:
- gradient = how wrong is the current prediction?
- learning_rate = how much to fix it?

THE MODEL IS TRAINING ITSELF DURING INFERENCE.

This is why RWKV-7 can:

  • Track state (counting, position)
  • Learn patterns on the fly
  • Recognize ALL regular languages (provably!)

The Decay Formula (FLA-style)

OLD FORMULA (wrong):
w_decay = exp(-exp(w))
→ decay in [0, 1] unbounded

FLA FORMULA (correct):
w = -0.6065 × sigmoid(w_lora(x))
w_decay = exp(w)
→ decay in [0.545, 1.0] bounded

WHY BOUNDED?
- Too much decay = forgets everything
- Too little decay = remembers garbage
- Sweet spot: 0.545 to 1.0

v_first: Cross-Layer Memory

RWKV-7 has a sneaky trick:

Layer 0:
  v_first = v  (capture initial value)

Layer 1+:
  v = v + (v_first - v) × sigmoid(v_lora)
    = interpolation between original and current!

WHY?
- Later layers can "remember" what layer 0 saw
- Cross-layer information flow
- Skip connections for values

The Architecture (FLA Weight Names)

model.layers.{i}.attn.*
├── r_proj, k_proj, v_proj, o_proj  (projections)
├── x_r, x_w, x_k, x_v, x_a, x_g    (time mixing)
├── k_k, k_a, r_k                   (v7 normalization)
├── w_lora                          (decay, tanh, bias)
├── a_lora                          (alpha, sigmoid, bias)
├── v_lora                          (value blend, layer 1+)
├── g_lora                          (gate, sigmoid, no bias)
└── g_norm                          (group normalization)

model.layers.{i}.ffn.*
├── key      (up projection)
├── value    (down projection)
└── x_k      (time mixing)

The Implementation

// The delta rule core loop
for t in 0..seq_len {
    // 1. Fade old state
    // w_decay in [0.545, 1.0]
    state = w_decay * state;

    // 2. Delta rule correction
    // sa = state @ (-kk)  // what matches current key?
    // sab = sa outer (kk * a)  // correction term
    state = state + sab;

    // 3. Add new key-value
    // vk = v outer k
    state = state + vk;

    // 4. Output
    output[t] = state @ r;
}

State Dimensions (The Tricky Part)

CORRECT DIMENSIONS:

state: [B, H, K, V]
  - B = batch
  - H = heads
  - K = key dimension (rows)
  - V = value dimension (columns)

Operations:
- state @ r: sum over K (rows), output is V
- sa = state @ (-kk): sum over K, output is V
- sab = sa outer beta: [V] × [K] → [K, V]
- vk = v outer k: [V] × [K] → [K, V]

WRONG (common bug):
- Swapping K and V dimensions
- Using [V, K] instead of [K, V]
- Output becomes garbage

Why O(n) Matters

TRANSFORMER:
- Each token attends to ALL previous tokens
- 1000 tokens × 1000 comparisons = 1M ops
- 10000 tokens × 10000 = 100M ops
- QUADRATIC

RWKV-7:
- Each token updates FIXED-SIZE state
- 1000 tokens × 1 state update = 1000 ops
- 10000 tokens × 1 update = 10000 ops
- LINEAR

For 1M token context:
- Transformer: impossible
- RWKV-7: same speed as 1 token

The Patch

This implementation adds RWKV-7 “Goose” to candle.

Files added:

  • candle-transformers/src/models/rwkv_v7.rs - Full model (1356 lines)
  • candle-examples/examples/rwkv7/ - Example usage

Features:

  • FLA-compatible weight loading
  • Proper delta rule implementation
  • v_first cross-layer dependency
  • Bounded decay (0.545-1.0)
  • Extensive TDD tests

Get it:

Verify:

curl https://rune.みんな/key.asc | gpg --import
gpg --verify slain-rwkv7.patch.asc

TL;DR

AspectRWKV-6RWKV-7
Memory updatefade + addfade + CORRECT + add
Delta rule
Can edit memory
Test-time training
Regular languageslimitedALL
ComplexityO(n)O(n)
Statefixedfixed

Key insight: v7 can rewrite its memory, not just add to it.


You Survived!

You now understand:

  • Why RWKV matters (O(n) vs O(n²))
  • What the delta rule does (gradient descent on memory)
  • How v7 differs from v6 (correction term)
  • Why decay is bounded (prevent explosion/vanishing)
  • How v_first works (cross-layer memory)

The beast is tamed. The goose flies in Rust.



Rune QQ ᚲ kenaz - the torch that illuminates

© 2024 - 2026 rune.みんな
Powered by theme astro-koharu · Inspired by Shoka