- Source
- arXiv
- Published
- Runtime
- 0:00
A conversation between
The Price of Anarchy in Disaggregated Inference
§03
Synthesis
## The Hidden Cost of Splitting Inference Work
When a large language model processes requests, it does so in two phases: *prefill* reads the input prompt (compute-heavy), and *decode* generates output tokens one at a time (memory-bound). Disaggregated inference architectures—notably NVIDIA Dynamo—physically separate these onto different GPU pools to optimize each independently. But this creates an unexpected problem: the two pools compete for the same fixed hardware budget, and when left to optimize selfishly, the system performs far worse than if they coordinated. This paper quantifies that inefficiency using game theory and proposes a fix.
## The Game-Theoretic Model
The authors model disaggregated serving as three coupled games. First, prefill and decode pools act as two strategic agents dividing a fixed GPU budget—whoever demands more resources leaves less for the other. Second, both pools share a hierarchical KV cache (the stored prompt embeddings needed during decode); selfish caching decisions can poison the cache for other requests. Third, incoming requests must be routed to available workers; this creates a congestion game where one request's success sometimes helps others by spreading load.
The key insight is that system behavior changes qualitatively depending on whether GPUs are saturated. Below saturation, there is slack—selfish choices have limited downside because there is enough hardware for everyone. The Price of Anarchy (PoA), a standard game-theory metric measuring the ratio of worst-case selfish performance to optimal coordinated performance, stays bounded. But at saturation, latency rises superlinearly, and the externalities from caching and routing suddenly dominate. PoA explodes.
## Real-World Validation
The authors tested this on a 3-node cluster running Dynamo with two large models: a 340B-parameter Nemotron and a 70B Llama. Both showed the same three-regime PoA structure as the theory predicted. Remarkably, both hit a saturation knee at exactly C=128 prefill batch size—the same critical point despite vastly different model sizes and configurations.
The strongest result came from a 70B setup with one prefill pool and five decode pools. At saturation, their adaptive controller—which switches from cache-aware routing to load-balanced routing when saturation is detected—reduced the empirical PoA estimate from 66.4 to 21.5 (a 3.1× improvement), at a modest 13% throughput cost. On a smaller 1P/2D topology, PoA dropped 2.2× and tail latency (P99 time-to-first-token) fell 7.6×.
## Why It Matters
Disaggregated inference is becoming standard because it improves per-phase efficiency. But operators have largely ignored the global cost of competition between pools. This work makes that cost visible and actionable. By detecting when a system transitions from slack to saturation—a shift that can happen suddenly as load increases—the adaptive controller steers the system to a better operating point. The result is lower latency and fairer resource use without requiring expensive hardware upgrades or complex centralized coordination. For production systems running at capacity, that 2–3× reduction in inefficiency translates directly to serving more requests at lower latency.
Mine your own.
Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.