Hunt 003: The Linear Beasts

Reading time: 6 min Prerequisites: None. We got you. Survival rate: 100% (these beasts are friendly)

The Quest (Why You Should Care)

You’ve heard of Mamba. You’ve heard of Delta Attention.

Different papers. Different teams. Different names.

But here’s the secret nobody told you:

They're the same beast wearing different skins.

Once you see it, you can’t unsee it.

The Hunt Board Notice

QUEST TYPE: CAPTURE - DO NOT SLAY

"Two rare beasts have been sighted. MAMBA in the
western servers (Tri Dao, Dec 2023), DELTA in the
eastern datacenters (Kimi, Nov 2024).

Guild research indicates they may be related species.

Capture both for study. High reward."

The Two Beasts: First Impressions

🐍 MAMBA                      △ DELTA LINEAR
----------------------------------------------
Born: Dec 2023               Born: Nov 2024
Origin: Western papers       Origin: Eastern papers
Creator: Tri Dao et al.      Creator: Kimi/Moonshot
llama.cpp: ✅ supported      llama.cpp: Issue #16930
Status: Community favorite   Status: New meta incoming

They LOOK different. Different equations. Different variable names.

But watch how they fight…

The Scary Runes (Side by Side)

🐍 Mamba’s Rune

h_t = A·h_{t-1} + B·x_t
y_t = C·h_t

△ Delta’s Rune

S_t = S_{t-1} + β_t ⊗ (v_t k_t^T - S_{t-1})
o_t = S_t · q_t

See? Totally different! …right?

Let’s bonk these runes.

Tamed Versions

🐍 Mamba Says

"I have a memory (h).
 Every step, I update it:

 new_memory = A × old_memory + B × new_input

 A = how much to KEEP from before
 B = how much the new stuff matters

 Then I output through C."

Even simpler: “Blend old memory with new input. Output the result.”

△ Delta Says

"I have a memory (S).
 Every step, I update it:

 new_memory = old_memory + β × (new_stuff - old_memory)

 β = how much to update (0 to 1)
 (new_stuff - old_memory) = THE DELTA (what's different!)

 Then I answer queries with q."

Even simpler: “Figure out what’s NEW. Move toward it a little bit.”

The Cousin Revelation

Now watch this:

MAMBA:  new = A × old + B × new_input
        "blend these two things"

DELTA:  new = old + β × (target - old)
        "move old toward target"

Rewrite Mamba slightly:

MAMBA:  new = (1-α) × old + α × new_input
        where α = B/(A+B)

DELTA:  new = (1-β) × old + β × new_stuff

THEY’RE THE SAME FORMULA.

+--------------------------------------------+
                                           |
  Both are just:                           |
                                           |
  new = (keep_this_much × old)            |
      + (add_this_much × new_stuff)       |
                                           |
  Different names. Same idea.              |
  THEY'RE COUSINS.                         |
                                           |
+--------------------------------------------+

The Family Tree

                RECURRENCE
                    |
         "update memory each step"
                    |
        +-----------+-----------+
                              |
     RNN/LSTM              STATE SPACE
   (old school)            (new school)
                              |
                     +--------+--------+
                                     |
                  🐍 MAMBA          △ DELTA
               "blend formula"   "delta formula"
                                     |
        +--------------+-----------------+
                       |
              SAME ANCESTOR:
         "fixed memory, linear compute"

BOTH beasts have:
- O(n) compute      → no quadratic curse!
- Fixed memory size → bounded, predictable
- Content-aware gate → they LEARN what matters
- Selective updates → "bouncer" logic
- Linear scaling    → 1M tokens? No problem

Different skins.
Different variable names.
SAME FAMILY.

The Everyday Analogy

Think of updating your phone:

MAMBA APPROACH:
- Take 80% of old apps
- Add 20% new apps
- Blend = new phone state
- "Keep most, add some"

DELTA APPROACH:
- Look at difference between old and target
- Move 20% toward target
- Result = new phone state
- "Move toward what I want"

SAME RESULT. Different mental model.

Why This Matters

HUNTER'S INSIGHT:

If you understand ONE of these beasts,
you understand BOTH.

- Learn Mamba first (more tutorials exist)
- When Delta drops fully, you're ready
- Same concepts, different notation
- Master the family, not just one member

DON'T:
- Wait for "the winner" to emerge
- Learn them as separate things
- Get confused by different notation

DO:
- See the family resemblance
- Learn the shared concepts
- Adapt to either instantly

Practical Status

🐍 MAMBA - Ready Now:
- Candle (Rust) ✅
- llama.cpp ✅
- HuggingFace ✅
- PyTorch native ✅
- GO USE IT

△ DELTA - Coming Soon:
- Kimi API (proprietary) ✅
- llama.cpp Issue #16930 (in progress)
- Open weights: kimi-k2 (2025)
- Community: catching up
- LEARN MAMBA, BE READY FOR DELTA

TL;DR

Aspect	🐍 Mamba	△ Delta
Core idea	Blend old + new	Add the difference
Gate name	Δ (discretization)	β (forget rate)
Memory	h (hidden state)	S (state matrix)
Complexity	O(n)	O(n)
Memory usage	Fixed	Fixed
Family	Linear attention	Linear attention
Usable now?	YES	Coming soon

Key insight: Learn one, understand both. They’re cousins.

You Survived!

You now understand the linear attention family better than most researchers who only read one paper.

The beasts looked different because:

Different authors
Different notation conventions
Different marketing

But now you see:

Same ancestor (recurrence)
Same goal (fixed memory, linear compute)
Same mechanism (gated state update)

The beasts are family. Nobody told you.

Rune QQ ᚦ bonk - we bonk the scary math so you don’t have to