FP4: the 16-value float powering billion-parameter models

The first time I heard “FP8” at a conference I thought someone was joking. Eight bits for a floating point number? That’s barely enough for a color channel. Then the benchmarks came out and I stopped laughing.

Now someone handed me FP4. Four bits. Sixteen total possible values. The whole number line for a model weight, expressed as a lookup table that fits in a tweet.

The story in one sentence

John Cook wrote a compact, mathematically honest explainer on FP4 - the 4-bit floating point format used in NVIDIA’s Blackwell AI hardware - and it turns out you can describe the entire format in one table.

The float ladder from FP32 down to FP4 collapses billions of values to just 16, and a Blackwell GPU runs them 6 times faster with under 1% accuracy lost.

What FP4 actually is

The most common FP4 layout is E2M1: 1 sign bit, 2 exponent bits, 1 mantissa bit. With 4 bits total you get exactly 16 distinct values. Here they are, in their entirety:

Exp	Mantissa	Value
00	0	0
00	1	0.5
01	0	1
01	1	1.5
10	0	2
10	1	3
11	0	4
11	1	6

Plus the same 8 negatives. That’s your number line: ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6.

This is not a mistake. The nonuniform spacing is a feature - the exponent bits give you more resolution near zero, which is exactly where neural network weights cluster. A plain 4-bit integer could only be 0, 1, 2, ..., 15. That flat distribution is terrible for weights. The floating point arrangement puts six values between 0 and 2, and only two between 4 and 6, which mirrors what the weights actually look like in a trained model.

Why this hit the front page

Cook’s explainer is the rare kind: short, no bullshit, with an actual value table so you can see all sixteen numbers with your own eyes. The post has a quiet elegance - here is everything a 4-bit float can be, numbered, no hype.

It hit HN because quantization is the current arms race. Every layer of a large model has millions of weights. FP32 costs 4 bytes each. FP16 costs 2. FP8 costs 1. FP4 costs half a byte - you can fit two weights in the space one FP16 weight used to occupy. On a model with 400 billion parameters, that arithmetic becomes a very large number very fast.

NVIDIA’s Blackwell hardware (B200, GB300) ships with native FP4 tensor cores. Their variant, NVFP4, adds a calibration layer: every block of 16 values gets its own FP8 scaling factor. This two-level scheme (local FP8 scale + global FP32 scale) recovers most of the dynamic range lost by cramming a weight into 4 bits. The result, according to NVIDIA’s own benchmarks, is less than 1% accuracy degradation versus FP8, with 6× more arithmetic throughput versus FP16 and 50% less memory bandwidth.

DeepSeek-R1 671B on a Blackwell B200 in FP4 is reportedly 3× faster than the same model on an H200 in FP8. For a model that large, “3×” is a billion dollars worth of infrastructure.

What the thread is arguing about

The HN comments are split between two camps.

Camp 1: this is beautiful engineering. The precision reduction is real but bounded. Cook’s post links to a Python snippet that generates the entire format. People are experimenting, posting the full list of representable values, and comparing E2M1 against alternative layouts like E3M0 (all exponent, log scale, no mantissa at all) and E1M2 (more mantissa resolution, less range). E3M0 is pure powers of two. E0M3 is linear. E2M1 is the sweet spot for weight distributions.

Camp 2: sixteen values is obviously insane. The skeptics point out that any general-purpose numeric computation breaks down immediately at this precision. You can’t do numerical integration, you can’t accumulate gradients, you can’t even reliably add two floating point numbers without catastrophic cancellation. FP4 is not a number format for computation - it is a storage format for model parameters that have already been trained and are being read in batches to feed into higher-precision math.

Both camps are right. FP4 only makes sense if you think of a model weight as a compressed lookup table entry, not a variable in an equation.

Precision vs format: a cheat sheet

Format	Bits	Distinct values	Typical use in ML
FP32	32	~4.3 billion	Training, gradients
BF16	16	~65,536	Mixed-precision training
FP8	8	256	Inference, activations
FP4	4	16	Weight storage, inference

The table is also, by itself, a history of how the field has been hacking around GPU memory constraints for the last eight years.

The uncomfortable elegance

There is something almost insulting about the efficiency of this. A billion years of mathematical tradition produced floating point arithmetic as a way to represent real numbers with sufficient precision for science and engineering. Then some researchers discovered that you can describe a useful neural network weight with the same information density as the answer to a simple multiple-choice question.

The network doesn’t care that the number is imprecise. It has seen enough training examples that the weight was never really a precise real number - it was always a rough location in a high-dimensional space. Rounding it to the nearest element of ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6 and calling it done turns out, empirically, to be fine.

Which is either a triumph of pragmatism or a fairly damning statement about what large language models actually are. Probably both.

Discussion on Hacker News · Source: johndcook.com · Submitted by chmaynard