Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

§03

Synthesis

## The Problem: Controlling Camera Movement in AI-Generated Videos

Imagine you have a video and want to refilm it from a completely different camera angle—say, moving the viewpoint left instead of right. Current AI methods struggle here. They either treat each frame independently (losing temporal consistency), rely on noisy 3D reconstructions, or learn hidden correspondences that don't generalize beyond their training data. The result: flickering, ghosting, and scenes that don't feel spatially coherent.

Track2View solves this by making the camera control problem explicit: instead of guessing where pixels should go, the method uses 3D point tracks—sparse trajectories showing where specific scene points appear in both the original and target camera views across time.

## How It Works: Point Tracks as a Bridge

The core insight is that a 3D point in space projects to different locations in different camera views. If you know where point A appears in the source video (frame-by-frame) and where it should appear in the target view (given the new camera path), you have a concrete map: "this content at this time in the source belongs there at that time in the target."

Track2View extracts these correspondences by running a 3D point tracker on concatenated multi-camera footage—a practical pipeline that finds one-to-one matches between source and target viewpoints. These sparse tracks are temporally continuous by construction, unlike per-frame methods that treat each moment in isolation.

The dual-view track conditioner is where the magic happens. It feeds these paired 3D tracks into a video diffusion transformer (a neural network that generates video frame-by-frame). The tracks use purely geometric operations—no learnable parameters—to transfer visual context from source to target, then a learned temporal aggregation module fuses that information across frames. This design ensures the model generalizes to arbitrary camera trajectories it hasn't seen during training, rather than memorizing specific motions.

## Why This Matters

On a 400-video benchmark covering static scenes (like rotating cameras around buildings) and dynamic ones (people moving, objects changing), Track2View outperforms existing methods by large margins: **30–65% lower rotation error and 61–72% lower translation error**. These aren't marginal improvements—they're the difference between a usable output and visibly broken results.

The practical payoff: filmmakers and VFX artists can now prescribe exact camera paths and trust that the AI will follow them while keeping the scene looking natural and temporally consistent. The explicit track-based conditioning also makes the method interpretable—you can see exactly which correspondences are driving the generation, unlike black-box learned embeddings.

The data pipeline is equally valuable. By automating track extraction from multi-view footage, the authors sidestep the need for expensive manual annotation, making the approach scalable.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator