Lode

A rich vein. Mine your giants.

Open the curator →
Source
arXiv
Published
Runtime
0:00

A conversation between

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

§03

Synthesis

## The Problem With Sparse Autoencoders

Sparse autoencoders (SAEs) are popular tools for understanding what neural networks learn. They decompose activations into interpretable "features"—individual neurons or directions that respond to specific concepts. But there's a catch: if you train the same SAE twice with different random seeds, you get different features. This raises a hard question: which features are real discoveries, and which are artifacts of the training process? Without knowing which features are reproducible, it's unclear whether SAE interpretations can be trusted.

## What the Authors Found

The authors quantify feature instability by asking: for each feature learned in one SAE, how often does a similar feature appear in independently trained SAEs? This per-feature stability score divides features cleanly into two camps. **Stable features** dominate reconstruction and prediction performance—they carry functional signal. **Unstable features** are barely useful; they activate in response to low-frequency, surface-level patterns (like rare words or formatting) rather than meaningful concepts.

The geometric insight is counterintuitive: while individual unstable features don't reproduce, they cluster in **reproducible low-dimensional subspaces**. Imagine a low-rank region of activation space that the SAE consistently finds—but across different seeds, the SAE rotates its basis within that subspace differently. So the features themselves flip and change, but the underlying structure they span remains the same. This isn't pure noise; it's basis ambiguity.

The authors support this with a synthetic model where ground-truth features are genuinely low-rank. When they train SAEs on this synthetic data, individual features fail to reproduce, but the subspace they occupy does—proving the mechanism empirically.

## Why It Matters

This changes how we should interpret SAE results. Rather than dismissing unstable features as failures, the authors show they reflect real but underspecified structure. Stable features are the reliable interpretations; unstable ones are genuine discoveries about the data geometry, even if individual latent directions aren't reproducible.

Practically, the authors propose pooling unique features across multiple seed-runs to build more stable SAEs while retaining explained variance. This is a simple trick with direct value.

The findings span a large empirical study—multiple models, layers, dictionary sizes, and SAE variants—so the asymmetry between stable and unstable features appears robust. The work quantifies something the field has sensed but not formally measured: some SAE features are solid ground truth, others are artifacts of optimization, and most fall somewhere in between in a predictable way.

For practitioners, the takeaway is clear: check stability before trusting an SAE feature for interpretation. For theorists, the subspace explanation suggests that SAE instability isn't a flaw in the method—it's a window into how neural networks compress information in lower-dimensional, geometrically redundant ways.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator