You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

§03

Synthesis

## The Trend: Fewer Assumptions, Better Results

The history of visual learning shows a clear pattern—each generation of methods succeeds by relaxing assumptions rather than tightening them. Supervised learning assumed labeled data. Weakly supervised learning dropped that requirement. Self-supervised learning eliminated human labels entirely. Yet modern self-supervised approaches still enforce strong inductive biases: they augment images, mask patches, or crop regions to force the model to learn. The authors' key insight is that these remaining biases should become bottlenecks as data scales up. Their experiments confirm it—the optimal strength of these biases actually decreases with more data. This suggests the next frontier: learning without any of these artificial constraints.

## The Method: Physics Over Priors

The authors introduce Temporal Difference in Vision (TDV), which learns representations from video using only one assumption: causality. The idea is elegantly simple. If you watch a video frame-by-frame, the current frame plus the motion between frames should predict the next frame. TDV builds this into a loss function by jointly training two neural networks—an image encoder and a motion encoder—such that:

**current frame representation + motion representation = next frame representation**

No augmentations. No masking. No cropping. Just raw video frames and the principle that the past determines the future. The method treats visual representation learning as a temporal prediction problem grounded in a physical assumption rather than a hand-crafted algorithmic one.

## Why It Matters

The practical implication is significant: TDV achieves state-of-the-art performance on dense spatial tasks—those requiring fine-grained spatial understanding like segmentation or depth estimation—despite abandoning techniques that have become standard in self-supervised learning. This validates the authors' hypothesis that strong inductive biases, while useful at smaller scales, become unnecessary and potentially harmful as data grows.

The conceptual contribution runs deeper. Self-supervised learning has relied on negative assumptions (what *not* to do: don't mask too much, don't crop too aggressively) rather than positive principles. TDV inverts this: it makes one clear causal assumption and lets the data do the work. This aligns with broader trends in deep learning where scaling—more compute, more data—tends to reward simpler, less biased approaches over hand-tuned ones.

For practitioners, the message is that video is a richer signal than static images precisely because it embeds temporal structure. Rather than forcing models to invent invariances through augmentation, TDV exploits the causal structure that already exists in video. As datasets grow, this principled approach should outperform methods built on assumptions that made sense at smaller scales.

The work also opens a research direction: what other weak, physically-grounded assumptions could replace the brittle inductive biases currently hardcoded into vision models? If the trend holds, the answer is: fewer than we think.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator