Source: arXiv
Published: 19 June 2026
Runtime: 0:00
Snippets: 4

A conversation between

Qian Zhao , Kunlong Chen , Changxin Tian , Zhonghui Jiang , Haitao Zhang , Chaofan Yu , Peijie Jiang , Mingliang Gong , Jia Liu , Ziqi Liu , Zhiqiang Zhang , Jun Zhou

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Source · arxiv.org/watch?v=2606.20381 ↗

0:00 0:00

§02

Snippets

№01

FP4 training with non-uniform E2M1 format suffers from Shrinkage Bias, a systematic negative rounding error from geometric asymmetry that accumulates multiplicatively across layers.

Explains why current FP4 hardware recipes are unstable and points toward a concrete fix.
№02

Uniform grids like E1M2 and INT4 eliminate grid-geometry error and better leverage Random Hadamard Transform, outperforming E2M1 across 1.5B to 124B model pretraining.

Demonstrates a simple format change delivers measurable training improvements without algorithmic overhead.
№03

UFP4 applies Random Hadamard Transform to all three training GEMMs and restricts stochastic rounding to gradient computation alone.

Provides a practical recipe that practitioners can implement on current hardware to improve 4-bit training.
№04

Shrinkage Bias accumulation is amplified by Random Hadamard Transform, unifying explanations for training instability in existing E2M1 recipes.

Reveals why a common quantization technique backfires with non-uniform formats, informing future algorithm design.

§03

Synthesis

## The Problem: Why FP4 Training Breaks Down

Training large language models in 4-bit floating point (FP4) could slash memory and compute costs dramatically, but current hardware like NVIDIA's Blackwell and AMD's MI350 GPUs standardize on E2M1 format—and it systematically fails. The authors identify the culprit: **Shrinkage Bias**, a fundamental geometric flaw in how E2M1 represents numbers.

E2M1 uses a non-uniform grid: the spacing between representable values isn't consistent across the range. When numbers are rounded to the nearest representable value, negative rounding errors occur more often than positive ones due to this asymmetry. During pretraining, this bias compounds across layers multiplicatively—small errors in early layers become large errors downstream. The problem worsens when the Random Hadamard Transform (RHT), a technique used to improve training stability, amplifies these errors further.

The key insight is geometric: the representable bins in E2M1 create an inherent directional bias toward smaller values that standard rounding cannot overcome.

## The Solution: Uniform Grids Instead

The authors propose **UFP4**, which swaps E2M1 for uniform-grid formats (E1M2 or INT4). Uniform grids space representable values evenly, eliminating the geometric asymmetry that causes Shrinkage Bias. INT4 and E1M2 formats maintain equal spacing, so rounding errors balance out rather than accumulate in one direction.

UFP4's recipe: - Apply RHT to all three training GEMMs (matrix multiplications: activations, weights, gradients) - Use stochastic rounding (random rounding to improve numerical stability) only on gradient updates (dY) - Keep everything else at 4-bit precision during forward and backward passes

This is deliberately minimalist—the authors avoid overcomplicating the approach, letting the uniform grid geometry do the work.

## Why It Matters: Real Gains at Scale

The authors tested UFP4 on three models: Dense 1.5B parameters, MoE (mixture-of-experts) 7.9B, and MoE 124B, running full pretraining from scratch. Compared to E2M1 baselines, UFP4 consistently achieves lower loss degradation relative to full BF16 (32-bit floating point) training. Scaling-law analysis confirms the advantage persists and may grow with model size.

This matters because training LLMs is astronomically expensive. A 1–2% reduction in loss degradation at 124B scale translates to real savings in compute and memory, multiplied across the industry. Unlike ad-hoc tuning tricks, UFP4's advantage flows from solving a root-cause geometric problem—suggesting the gains are robust and portable across architectures.

The results also carry an implicit recommendation to hardware vendors: future accelerators should offer uniform 4-bit formats as first-class primitives, not edge cases. Current hardware choices (E2M1) are efficient to implement but mathematically suboptimal for training. The paper argues that modest hardware changes could unlock better scaling curves.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator