Source: arXiv
Published: 19 June 2026
Runtime: 0:00
Snippets: 5

A conversation between

Yuanxin Liu , Ruida Zhou , Xinyan Zhao , Amr Sharaf , Hongzhou Lin , Arijit Biswas , Mohammad Ghavamzadeh , Zhaoran Wang , Mingyi Hong

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Source · arxiv.org/watch?v=2606.18910 ↗

0:00 0:00

§02

Snippets

№01

A two-stage framework alternates between online data augmentation and policy optimization, converting intermediate mistakes into decoupled revision and verification prompts.

Existing post-training methods ignore high-quality errors in successful trajectories; this systematically exploits them for more efficient learning.
№02

On LiveCodeBench, the method gains +6.5 points over RL baseline and +4.0 points over standard multi-turn training.

Demonstrates substantial improvements over recent multi-turn RL approaches in practical coding benchmarks with minimal computational overhead.
№03

The approach matches SOTA on circle packing using a 4B base model with far fewer rollouts than larger evolutionary search systems.

Achieves competitive results on geometric optimization with dramatically smaller and more efficient models than existing methods.
№04

Off-policy data generation and decoupled training reduce computational overhead of long-horizon sampling compared to standard multi-turn RL.

Makes test-time scaling more practical by addressing the computational bottleneck of sampling long reasoning trajectories.
№05

The method generalizes to out-of-distribution constraint-satisfaction puzzles like n-queens and mini-sudoku without task-specific tuning.

Shows the framework learns generalizable error-correction abilities rather than memorizing domain-specific solutions.

§03

Synthesis

## The Core Problem: Training Doesn't Match How LLMs Actually Reason at Test Time

Standard training of large language models optimizes for getting the right answer on the first try. But modern reasoning systems work differently—they revise and verify their answers across multiple steps. This creates a mismatch: the model is trained for single-shot success but deployed to iterate. Recent work frames this as reinforcement learning over multi-step trajectories, but that approach misses a key insight: when an LLM fails and then recovers, the intermediate "near-miss" steps contain valuable learning signals that aren't being exploited.

The authors' key claim is that by explicitly converting these intermediate failures into separate training objectives—one for revision (fixing the answer) and one for verification (spotting errors)—you can train more efficiently and achieve better test-time scaling than standard multi-turn RL.

## How REVES Works

The framework alternates between two stages:

**Stage 1: Data Augmentation.** When the model attempts to solve a problem and initially fails but eventually recovers (e.g., code that crashes, then gets debugged successfully), the method extracts the intermediate steps. It converts the "near-miss" answer into a revision prompt ("fix this answer") and a verification prompt ("is this answer correct?"). This creates multiple decoupled training instances from a single successful recovery trajectory, generating more learning material without more expensive rollouts.

**Stage 2: Policy Optimization.** Rather than training on raw multi-step sequences, the model learns two specific capabilities in isolation: transforming flawed answers into correct ones, and identifying when answers are wrong. This decomposition concentrates gradient updates on the skills that matter for iterative refinement.

The approach avoids repeatedly sampling long trajectories from scratch—a computational bottleneck in standard multi-turn RL. Instead, it mines existing successful trajectories for hidden teaching moments.

## Results and Scope

On LiveCodeBench (a coding benchmark with public test cases as feedback), REVES gains +6.5 points over RL baselines and +4.0 points over standard multi-turn training. It matches previously reported state-of-the-art on circle packing (a geometry optimization problem) while using a much smaller 4B-parameter base model and far fewer rollouts than existing evolutionary search systems.

The approach also generalizes beyond its training domain. On constraint-satisfaction puzzles (n-queens, mini-sudoku) where correctness is defined purely by problem constraints rather than gold solutions, the model shows improved correction ability.

## Why This Matters

Test-time scaling—spending more compute during inference to improve quality—is becoming central to LLM deployment. But scaling only works if training actually prepares the model for the multi-step, error-aware reasoning that happens at test time. REVES addresses this alignment gap by treating revision and verification as learnable primitives rather than emergent byproducts of trajectory optimization. The result is both more sample-efficient training and better generalization to new problem types.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator