FP4: the 16-value float powering billion-parameter models
The first time I heard “FP8” at a conference I thought someone was joking. Eight bits for a floating point number? That’s barely enough for a color channel. Then the benchmarks came out and I stopped laughing.
Now someone handed me FP4. Four bits. Sixteen total possible values. The whole number line for a model weight, expressed as a lookup table that fits in a tweet.
The story in one sentence
John Cook wrote a compact, mathematically honest explainer on FP4 - the 4-bit floating point format used in NVIDIA’s Blackwell AI hardware - and it turns out you can describe the entire format in one table.
--> // making it invisible to querySelectorAll. // // `data-cfasync="false"` keeps this rescue script executable even when // Rocket Loader is active. It rescues module scripts via two strategies: // 1. Query the DOM for type$="-module" + src (covers case A) // 2. Regex-parse the raw HTML for commented-out script tags (covers case B) // Dynamically-created scripts bypass Rocket Loader entirely. (function () { if (window.__markdyRescue) return; window.__markdyRescue = true; var rescued = false; function rescueModuleScripts() { if (rescued) return; rescued = true; var srcs = []; // Strategy 1: Rocket Loader kept the tag in DOM but changed the type. // type="module" → type="{uuid}-module" (still has src attribute) document.querySelectorAll('script[type$="-module"][src]').forEach(function (s) { srcs.push(s.src); }); // Strategy 2: Rocket Loader COMMENTED OUT the script tag entirely: // // These are invisible to querySelectorAll, so we parse the raw HTML. // We handle both attribute orderings (type-first or src-first). var html = document.documentElement.innerHTML; var reSrcFirst = //g; var reTypeFirst = //g; var m; while ((m = reSrcFirst.exec(html)) !== null) { srcs.push(m[1]); } while ((m = reTypeFirst.exec(html)) !== null) { srcs.push(m[1]); } // Re-inject each found src as a real module script. // Deduplicate first, then inject. Dynamically-created scripts bypass // Rocket Loader entirely. Modules with the same URL are only executed // once by the browser (cached), so re-injecting already-running scripts // is safe. var seen = {}; srcs.forEach(function (src) { if (seen[src]) return; seen[src] = true; var fix = document.createElement('script'); fix.type = 'module'; fix.src = src; fix.setAttribute('data-cfasync', 'false'); document.head.appendChild(fix); }); } // Rescue when user clicks the placeholder (fallback if autoplay failed). document.addEventListener('click', function (e) { var t = e.target; if (t && typeof t.closest === 'function' && t.closest('.markdy-placeholder')) { rescueModuleScripts(); } }); // Rescue automatically after a short delay for autoplay. // Only fires if initAll() never ran (no data-markdy-init on any root). setTimeout(function () { if (document.querySelector('.markdy-root:not([data-markdy-init])')) { rescueModuleScripts(); } }, 1500); }());What FP4 actually is
The most common FP4 layout is E2M1: 1 sign bit, 2 exponent bits, 1 mantissa bit. With 4 bits total you get exactly 16 distinct values. Here they are, in their entirety:
| Sign | Exp | Mantissa | Value |
|---|---|---|---|
| 0 | 00 | 0 | 0 |
| 0 | 00 | 1 | 0.5 |
| 0 | 01 | 0 | 1 |
| 0 | 01 | 1 | 1.5 |
| 0 | 10 | 0 | 2 |
| 0 | 10 | 1 | 3 |
| 0 | 11 | 0 | 4 |
| 0 | 11 | 1 | 6 |
Plus the same 8 negatives. That’s your number line: ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6.
This is not a mistake. The nonuniform spacing is a feature - the exponent bits give you more resolution near zero, which is exactly where neural network weights cluster. A plain 4-bit integer could only be 0, 1, 2, ..., 15. That flat distribution is terrible for weights. The floating point arrangement puts six values between 0 and 2, and only two between 4 and 6, which mirrors what the weights actually look like in a trained model.
Why this hit the front page
Cook’s explainer is the rare kind: short, no bullshit, with an actual value table so you can see all sixteen numbers with your own eyes. The post has a quiet elegance - here is everything a 4-bit float can be, numbered, no hype.
It hit HN because quantization is the current arms race. Every layer of a large model has millions of weights. FP32 costs 4 bytes each. FP16 costs 2. FP8 costs 1. FP4 costs half a byte - you can fit two weights in the space one FP16 weight used to occupy. On a model with 400 billion parameters, that arithmetic becomes a very large number very fast.
NVIDIA’s Blackwell hardware (B200, GB300) ships with native FP4 tensor cores. Their variant, NVFP4, adds a calibration layer: every block of 16 values gets its own FP8 scaling factor. This two-level scheme (local FP8 scale + global FP32 scale) recovers most of the dynamic range lost by cramming a weight into 4 bits. The result, according to NVIDIA’s own benchmarks, is less than 1% accuracy degradation versus FP8, with 6× more arithmetic throughput versus FP16 and 50% less memory bandwidth.
DeepSeek-R1 671B on a Blackwell B200 in FP4 is reportedly 3× faster than the same model on an H200 in FP8. For a model that large, “3×” is a billion dollars worth of infrastructure.
What the thread is arguing about
The HN comments are split between two camps.
Camp 1: this is beautiful engineering. The precision reduction is real but bounded. Cook’s post links to a Python snippet that generates the entire format. People are experimenting, posting the full list of representable values, and comparing E2M1 against alternative layouts like E3M0 (all exponent, log scale, no mantissa at all) and E1M2 (more mantissa resolution, less range). E3M0 is pure powers of two. E0M3 is linear. E2M1 is the sweet spot for weight distributions.
Camp 2: sixteen values is obviously insane. The skeptics point out that any general-purpose numeric computation breaks down immediately at this precision. You can’t do numerical integration, you can’t accumulate gradients, you can’t even reliably add two floating point numbers without catastrophic cancellation. FP4 is not a number format for computation - it is a storage format for model parameters that have already been trained and are being read in batches to feed into higher-precision math.
Both camps are right. FP4 only makes sense if you think of a model weight as a compressed lookup table entry, not a variable in an equation.
Precision vs format: a cheat sheet
| Format | Bits | Distinct values | Typical use in ML |
|---|---|---|---|
| FP32 | 32 | ~4.3 billion | Training, gradients |
| BF16 | 16 | ~65,536 | Mixed-precision training |
| FP8 | 8 | 256 | Inference, activations |
| FP4 | 4 | 16 | Weight storage, inference |
The table is also, by itself, a history of how the field has been hacking around GPU memory constraints for the last eight years.
The uncomfortable elegance
There is something almost insulting about the efficiency of this. A billion years of mathematical tradition produced floating point arithmetic as a way to represent real numbers with sufficient precision for science and engineering. Then some researchers discovered that you can describe a useful neural network weight with the same information density as the answer to a simple multiple-choice question.
The network doesn’t care that the number is imprecise. It has seen enough training examples that the weight was never really a precise real number - it was always a rough location in a high-dimensional space. Rounding it to the nearest element of ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6 and calling it done turns out, empirically, to be fine.
Which is either a triumph of pragmatism or a fairly damning statement about what large language models actually are. Probably both.
Discussion on Hacker News · Source: johndcook.com · Submitted by chmaynard
Hoang Yell
A software developer and technical storyteller. I read Hacker News every day and retell the best stories here — in English and Vietnamese — for curious people who don't have time to scroll.