Reading time: 8 min Prerequisites: Know what a tensor is. That’s it. Survival rate: 100% (the math is simpler than it looks)
The Quest (Why You Should Care)
Your 7B model needs 28GB of VRAM. You have 8GB.
QUEST BOARD:
"A technique exists to compress models 8x
with minimal quality loss. The ancients
called it NF4 (4-bit NormalFloat).
bitsandbytes guards the Python temple.
But Rust has no such blessing.
Until now."
The Beast: Full Precision Weights
BEAST: F32 WEIGHTS
TYPE: Memory Hog
HP: ∞ (unlimited VRAM hunger)
Each weight = 32 bits = 4 bytes
7B params × 4 bytes = 28 GB
Your GPU: "I literally cannot"
The Weapon: 4-bit Quantization
What if each weight was 4 bits instead of 32?
COMPRESSION:
32 bits → 4 bits = 8× smaller
7B params × 0.5 bytes = 3.5 GB
Your GPU: "I can do that"
But wait—how do you represent a continuous value with only 16 possible numbers?
The Magic Table (NF4’s Secret)
Neural network weights follow a normal distribution.
Most weights cluster around zero. Few are at extremes.
NF4 exploits this:
THE 16 SACRED VALUES:
Index | Value
-------|----------
0 | -1.0000
1 | -0.6962
2 | -0.5251
3 | -0.3949
4 | -0.2844
5 | -0.1848
6 | -0.0911
7 | 0.0000
8 | 0.0796
9 | 0.1609
10 | 0.2461
11 | 0.3379
12 | 0.4407
13 | 0.5626
14 | 0.7230
15 | 1.0000
These aren’t random. They’re quantiles of a normal distribution.
Each value represents where ~6.25% of weights naturally fall.
Tamed Version
NORMAL QUANTIZATION (bad):
[-1, 1] split into 16 equal parts
→ wastes codes on empty space
NF4 QUANTIZATION (good):
[-1, 1] split by WHERE WEIGHTS ACTUALLY ARE
→ more precision where it matters
Think of it like this:
BAD: Assign equal parking spots in a city
(most spots empty in suburbs)
GOOD: Assign spots by population density
(lots downtown, few in suburbs)
NF4 = density-aware parking
The Hunt: How It Works
Step 1: Find the Scale (absmax)
For each block of 64 weights:
scale = max(|w|) ← biggest absolute value
This is the "absmax"
Step 2: Normalize
normalized = weight / scale
Now all weights are in [-1, 1]
Step 3: Quantize
Find closest value in the magic table
Store the 4-bit index (0-15)
Step 4: Pack
Two 4-bit values fit in one byte!
[index1][index2] = 1 byte
Half the storage of just storing indices
The Formula (Scary Rune)
QUANTIZE:
q = argmin_i |normalized - TABLE[i]|
DEQUANTIZE:
weight = TABLE[q] × scale
That’s it. That’s the whole thing.
The CUDA Kernel (The Weapon)
// The magic table lives in GPU constant memory
__device__ __constant__ float NF4_TABLE[16] = {...};
// Dequantize: index → value
__device__ float dequantize_nf4(unsigned char val) {
return NF4_TABLE[val & 0x0F];
}
// Each thread handles 2 values (1 packed byte)
__global__ void nf4_dequantize_kernel(
const uint8_t* packed,
const float* absmax,
float* output,
int n,
int block_size
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int elem = idx * 2;
if (elem >= n) return;
uint8_t byte = packed[idx];
float scale = absmax[elem / block_size];
// High nibble = first element
output[elem] = dequantize_nf4(byte >> 4) * scale;
// Low nibble = second element
if (elem + 1 < n) {
output[elem + 1] = dequantize_nf4(byte) * scale;
}
}
Why Binary Search? (The Optimization)
Quantization needs to find closest table value.
Linear search: 16 comparisons Binary search: 4 comparisons
if x > 0.0398?
/ \
YES NO
/ \
if x > 0.389? if x > -0.340?
... ...
4 if statements → done. Fast.
The Rust Implementation
// The sacred table
pub const NF4_DEQUANT_TABLE: [f32; 16] = [
-1.0, -0.6962, -0.5251, -0.3949,
-0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379,
0.4407, 0.5626, 0.7230, 1.0,
];
// CPU dequantization
pub fn dequantize(storage: &Nf4Storage) -> Vec<f32> {
let mut output = vec![0.0; storage.num_elements];
for i in 0..storage.num_elements {
let byte_idx = i / 2;
let packed = storage.data[byte_idx];
let q = if i % 2 == 0 {
(packed >> 4) & 0x0F
} else {
packed & 0x0F
};
let scale = storage.absmax[i / 64];
output[i] = NF4_DEQUANT_TABLE[q as usize] * scale;
}
output
}
GPU-Resident Storage (Zero Copy)
The real speedup: keep quantized weights on GPU.
BAD PATH:
CPU [quantized] → copy → GPU [quantized] → dequant → GPU [float]
GOOD PATH:
GPU [quantized already there] → dequant → GPU [float]
Zero copies. Maximum speed.
Our Nf4StorageGpu does this.
Benchmarks
7B MODEL LOADING:
F32: 28 GB VRAM, 12 sec load
F16: 14 GB VRAM, 6 sec load
NF4: 3.5 GB VRAM, 2 sec load ← this
DEQUANTIZATION SPEED:
CPU: slow (but works)
CUDA: ~400x faster than CPU
TL;DR
| Aspect | Value |
|---|---|
| Bits per weight | 4 (vs 32) |
| Compression | ~8× |
| Quality loss | Minimal |
| Secret | Normal distribution quantiles |
| Block size | 64 elements |
| Storage | packed bytes + absmax |
| GPU support | F32, F16, BF16 |
The Patch
This implementation adds NF4 to candle (Rust ML framework).
Files added:
candle-kernels/src/nf4.cu- CUDA kernelscandle-core/src/nf4/mod.rs- CPU + APIcandle-core/src/nf4/cuda.rs- CUDA bridgecandle-core/src/nf4/gpu.rs- Zero-copy GPU storage
Get it:
- Patch: slain-nf4.patch
- Signature: slain-nf4.patch.asc
- Source: nf4-candle.tar.gz
Verify:
curl https://rune.みんな/key.asc | gpg --import
gpg --verify slain-nf4.patch.asc
You Survived!
You now understand:
- Why NF4 uses 16 specific values (normal distribution quantiles)
- How quantization works (normalize → lookup → pack)
- Why GPU-resident storage matters (zero copies)
- How to verify signed patches (trust nobody)
The beast is tamed. The model fits in VRAM.
Rune QQ ᚲ kenaz - the torch that illuminates