When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Source · arxiv.org/watch?v=2606.18531 ↗

0:00 0:00

§02

Snippets

№01

Trajectory-level supervision costs Õ(H²C_sa(π*)/n) samples—matching lower bounds show this is the fundamental price of missing step-wise rewards.

Quantifies the exact statistical penalty of learning from only outcome labels instead of per-step rewards in offline RL.
№02

OPAC, a pessimistic actor-critic algorithm, jointly learns a latent reward model and policy from scalar trajectory labels with theoretical guarantees.

Provides a practical algorithm with provable efficiency for the common case where datasets record only final outcomes, not step rewards.
№03

Generalized outcome-based RL with nonlinear reward aggregation requires exponential (2^H) samples in the worst case, even with deterministic transitions.

Reveals a fundamental computational barrier: not all trajectory-level supervision problems are learnable in polynomial time.
№04

Two structural coefficients (κ_μ(σ) and χ_μ(σ)) identify tractable regimes for generalized outcome-based RL where polynomial sample complexity is achievable.

Characterizes when the exponential barrier can be overcome, enabling efficient learning under restrictive structural assumptions.
№05

Preference-based feedback can replace trajectory labels while preserving leading horizon and concentrability dependence up to model-dependent constants.

Shows the theoretical results extend to practical pairwise comparison feedback, a common form of human supervision.

§03

Synthesis

## The Core Finding

Offline reinforcement learning—training policies from fixed datasets without interaction—normally assumes access to reward signals at every step. But real sequential datasets often come with only trajectory-level labels: a single number per trajectory (e.g., "mission succeeded or failed"). The central question is stark: **when does this coarse supervision allow efficient learning, and when does it create an insurmountable statistical wall?**

The authors show the answer depends critically on the problem structure. For standard cumulative reward objectives, outcome-level supervision is learnable but expensive—sample complexity scales with H² (horizon squared) compared to the baseline H under step-level rewards. For more complex objectives (like success-only or nonlinear aggregations), learning becomes impossible without additional structural assumptions.

## Method: OPAC and Its Guarantees

The authors propose OPAC, a pessimistic actor-critic algorithm that jointly learns two things: a latent reward model (inferring per-step rewards from trajectory labels) and a policy that exploits this model conservatively. The pessimism principle—favoring actions backed by lower-confidence estimates—protects against overfitting to a small, fixed dataset.

For the standard case (expected cumulative reward with one label per trajectory), OPAC achieves sample complexity of Õ(H²C_{sa(π*)}/n), where C_{sa(π*)} is a concentrability coefficient measuring how well the data distribution covers the optimal policy's state-action space. Crucially, the authors prove a matching lower bound, showing this rate is tight: **you cannot do better than H² in the worst case**. The extra H factor is the fundamental cost of trajectory-level supervision.

The intuition is simple: with only a scalar label per trajectory, the algorithm must infer rewards for all H steps from one noisy outcome. This amplifies estimation error across the horizon, creating an extra H multiplicative penalty.

## Extensions and Phase Transitions

The analysis extends to **preference-based feedback**, where trajectories are compared pairwise ("trajectory A is better than B"). The rate preserves the leading H² dependence, though with additional constants from modeling preferences.

The picture darkens for **generalized outcome-level objectives**—where both supervision and the goal are trajectory-level quantities defined by nonlinear aggregation of latent rewards (e.g., "maximize probability of success"). Here, learnability hits a phase transition:

- **All-success objectives** (binary outcomes) require Ω(2^H) samples in the worst case—exponential in horizon. This is a fundamental barrier, not a limitation of the algorithm. - **Tractable regime**: The authors identify two structural coefficients, κ_μ(σ) and χ_μ(σ), that quantify information loss during outcome aggregation and generalized Bellman backups. When these coefficients remain bounded, generalized OPAC recovers polynomial sample complexity.

## Why It Matters

This work translates a practical constraint—many real datasets lack step-level rewards—into formal statistical theory. It explains *when* trajectory labels suffice and *when* they force exponential sample complexity. The tight upper and lower bounds establish that practitioners face an inherent H² penalty for using outcome-level supervision on cumulative objectives, while the phase transition for complex objectives shows that some problem structures are fundamentally unlearnable from trajectory labels alone. These insights guide when to collect richer annotations and when outcome-level supervision may still work.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator