- Source
- arXiv
- Published
- Runtime
- 0:00
- Snippets
- 4
A conversation between
Duration Aware Scheduling for ASR Serving Under Workload Drift
§02
Snippets
-
Audio duration is an accurate proxy for job processing time in ASR models, enabling duration-aware scheduling to reduce head-of-line blocking.
Current ASR serving relies on FCFS, which wastes resources when requests have vastly different processing times.
-
HRRN scheduling reduces median E2E latency by up to 28% while limiting tail-latency increase to 24%, avoiding starvation issues of pure shortest-job-first.
Balances latency improvements with fairness, a critical tradeoff for production systems serving heterogeneous workloads.
-
Duration-aware scheduling maintains gains under workload drift with no throughput penalty and minimal scheduling overhead (<0.1 ms per request).
Demonstrates that the approach is robust to real-world changes in request patterns and practical to deploy at scale.
-
SJF achieves 73% median latency reduction but causes up to 97% tail-latency degradation due to starvation of long audio requests.
Reveals the fundamental fairness-latency tradeoff that simpler scheduling policies cannot resolve for ASR serving.
§03
Synthesis
## The Problem: Why Your ASR Serving Pipeline Is Slow
Automatic Speech Recognition (ASR) systems like Whisper handle requests of wildly different sizes—a 10-second clip processes much faster than a 5-minute audio file. Yet most production serving engines, including vLLM, use first-come-first-served (FCFS) scheduling, which treats all requests equally. This creates head-of-line blocking: if a long audio file arrives first, every subsequent request waits behind it, even if they're tiny. The authors show this problem gets worse when workload composition drifts—when the mix of short and long requests changes unpredictably over time.
## The Key Insight: Duration Predicts Job Time
The central claim is simple but powerful: in ASR models, the duration of input audio is a strong predictor of how long processing will take. The authors verify this empirically on standard benchmarks and use it to enable *duration-aware scheduling*—assigning priorities based on expected job length rather than arrival order.
## The Scheduling Approaches
The authors integrate two classical algorithms into vLLM:
**Shortest Job First (SJF)** is aggressive: always process the shortest pending request next. Results are dramatic—median end-to-end (E2E) latency drops by up to 73% at high load compared to FCFS. But there's a harsh penalty: tail latency (the 90th percentile) *increases* by up to 97%. Long requests starve, waiting indefinitely while short ones skip ahead.
**Highest Response Ratio Next (HRRN)** balances this tradeoff. Instead of pure duration, it weighs both job size and wait time, so long requests that have lingered gain priority even if they're not smallest. The empirical result: median latency improves by up to 28%—still substantial—while tail latency degrades by *at most* 24%. This is vastly better than SJF's starvation problem.
## Why This Matters
These gains survive under workload drift, meaning the scheduler works even when request patterns change unexpectedly. Overhead is negligible (<0.1 ms per request). Throughput stays flat—you're not sacrificing total capacity to improve latency distribution.
For production ASR systems handling millions of requests, a 28% median latency cut with bounded tail-latency impact is significant. HRRN offers a practical middle ground: most users see faster responses, and even unlucky users waiting for long transcriptions don't get starved indefinitely. The fact that duration is predictable and available at request submission (you know audio length before processing starts) makes this immediately deployable in real pipelines.
The work is a reminder that classical scheduling algorithms from operating systems—often considered solved problems—remain relevant and powerful when applied thoughtfully to modern serving workloads with new constraints like unpredictable drift and strict latency SLAs.
Mine your own.
Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.