Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

Source · arxiv.org/watch?v=2606.19808 ↗

0:00 0:00

§02

Snippets

№01

Selective verification reaches 76.3% accuracy on MATH-5 while reducing post-generation tokens by 26.8% compared to always verifying.

Shows that intelligently choosing when to verify can improve efficiency without sacrificing accuracy, relevant for resource-constrained deployments.
№02

A longer initial solve often matches selective verification's accuracy with fewer total model tokens, suggesting tuning initial budget first is more cost-effective.

Challenges the assumption that verification is always the best use of compute—simple allocation strategies may be superior.
№03

Recoverability-aware gates trained from frozen attempt state decide whether to invoke verification, reducing harmful answer changes from 2.2% to 1.0%.

Demonstrates that selective verification can mitigate regression risk—a practical concern for production systems where wrong answers are costly.
№04

On GSM8K, the selective policy verifies only 3% of examples while improving accuracy from 93.4% to 94.5%, reducing verification tokens by 91.2%.

Shows that selective verification enables dramatic token savings when deployed across different tasks without retraining.
№05

Extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on correct answers, or introduce harmful changes.

Reframes test-time reasoning as a deployment allocation problem rather than purely a verification problem, opening new optimization angles.

§03

Synthesis

## The Problem: Not All Extra Thinking Helps

When language models get a chance to reconsider their answers at test time, the results are mixed. Extra reasoning can fix mistakes, but it also wastes compute on already-correct answers and sometimes introduces new errors. The authors reframe this as a deployment problem: given a frozen solver and a reasoning budget, when should you actually invoke verification, and when should you trust the first answer?

## How SEVRA Works

SEVRA is a serving-layer gate trained to decide, after the model produces an initial answer, whether verification is worth invoking. The key insight is that this decision should be based on *recoverability*—whether verification is likely to improve rather than harm the answer.

The authors use a frozen Qwen3-4B model to generate initial solutions, then log what happens when verification is applied. They extract signals visible to the serving layer (attempt state) and train lightweight gates that predict whether verification will help. The gate learns patterns of when answers are fragile or incorrect enough to merit a second look, and when the initial response is likely solid.

## What the Numbers Show

On MATH-500, SEVRA reaches 76.3% accuracy while reducing post-generation tokens by 26.8%—beating always-verify (75.5% accuracy) and cutting harmful answer flips from 2.2% to 1.0%. But there's a catch: simply giving the solver 8,192 tokens for the initial attempt achieves 76.0% accuracy with 28% fewer total tokens. Selective verification helps, but it's not the best cost frontier.

Transfer to GSM8K shows different trade-offs: selective verification verifies only 3% of examples, improves accuracy from 93.4% to 94.5%, and cuts verification tokens by 91.2%. Yet again, a longer initial solve matches that accuracy more efficiently.

On CommonsenseQA, always-on verification actually hurts. Self-Consistency@5 (running 5 independent solutions) improves accuracy but costs about five times as many tokens.

## The Takeaway

SEVRA works—it recovers some failed attempts while avoiding wasted compute and harmful corrections. However, it's not universally superior to simply increasing the budget of the initial solve. The authors' bottom-line recommendation: **tune the initial budget first**. Use selective verification only when you need explicit auditability, bounded retries, regression-risk control, or explicit checks that justify the extra complexity. In many practical scenarios, a more generous initial reasoning budget is simpler and cheaper.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator