LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Source · arxiv.org/watch?v=2606.20529 ↗

0:00 0:00

§02

Snippets

№01

LedgerAgent maintains task states in a separate ledger and renders them into prompts, preventing agents from grounding decisions in stale or incorrect information.

Explicit state management addresses a fundamental failure mode where agents lose track of facts across conversation turns.
№02

The ledger checks state-dependent policy constraints before tool execution, blocking policy violations at inference time rather than relying on the agent's reasoning alone.

Policy enforcement becomes verifiable and deterministic rather than probabilistic, critical for customer-service domains with strict compliance requirements.
№03

LedgerAgent improves pass@k across four customer-service domains with both open- and closed-weight models, with largest gains under stricter multi-trial consistency metrics.

The method generalizes across different model scales and families, and benefits most when reliability across multiple attempts matters most.
№04

Standard agents reconstruct task states implicitly from prompts each turn, while LedgerAgent separates state representation into an explicit data structure.

This architectural shift reduces cognitive load on the model and makes debugging easier compared to purely prompt-based state tracking.

§03

Synthesis

## The Problem: Agents Lose Track

Standard tool-calling agents—those designed to interact with APIs and databases in customer-service scenarios—struggle with two interconnected failures. First, they reconstruct task state (customer account details, order constraints, eligibility conditions) from the full prompt each turn, making them vulnerable to using stale or contradictory information. Second, they generate syntactically correct tool calls that nonetheless violate domain policies—a support agent might offer a refund that violates the customer's account tier, for instance—because the policy check happens after the fact, if at all.

The core insight: task state should not be implicit in the prompt. It should be explicit, tracked separately, and actively enforced.

## LedgerAgent: Structured State + Proactive Enforcement

The authors introduce LedgerAgent, an inference-time method that decouples state management from standard prompting. The approach works in three parts.

**The ledger itself** is a structured record of observed task states—relevant facts, identifiers, constraints, and conditions learned from user messages and tool responses. Rather than burying these in the conversation history, the ledger maintains them as a canonical reference.

**State rendering** injects the ledger contents back into the prompt before each decision step, so the agent grounds its next action in current, explicit state rather than re-parsing the full chat history.

**Policy verification** uses the ledger to enforce state-dependent constraints *before* tool calls execute. If a policy rule (e.g., "refunds only allowed for accounts over 30 days old") depends on ledger state, the system blocks the call if the rule would be violated, rather than allowing the agent to generate a violation and catching it downstream.

The method is applied at inference time—no model retraining required—making it broadly compatible with both open-weight and closed-weight models.

## Results Across Four Domains

The authors evaluate LedgerAgent on four customer-service domains using a mix of open and closed-weight models. The metric that matters most here is "pass@k under stricter multi-trial consistency," reflecting real-world requirements: an agent must succeed repeatedly and reliably, not just once.

LedgerAgent improves average pass@k over a standard prompt-based baseline across all four domains. The largest gains come when the evaluation metric is strictest—i.e., when the agent is required to maintain policy adherence and state consistency over multiple trials. This directly addresses the two failure modes the authors identified: stale state and policy violation.

## Why It Matters

Customer-service automation is high-stakes. A misconstrued refund policy or a forgotten account constraint costs money and erodes trust. Most deployed agents today either rely on rigid rule engines (low flexibility) or free-form LLM agents (high flexibility, low reliability). LedgerAgent offers a middle path: the agent retains generative flexibility but gains explicit state tracking and automatic policy guardrails. Because the method works at inference time, it can be retrofitted onto existing models and deployed without retraining, making it practically valuable for teams already running customer-service agents in production.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator