- Source
- arXiv
- Published
- Runtime
- 0:00
- Snippets
- 4
A conversation between
iOSWorld: A Benchmark for Personally Intelligent Phone Agents
§02
Snippets
-
iOSWorld is the first iOS simulator benchmark where agents must reason over persistent user identity, history, and preferences embedded across 26 connected apps.
Existing mobile benchmarks treat tasks in isolation; this tests whether agents can actually personalize behavior like real assistants need to.
-
Multi-app tasks—spanning 2 to 8 apps—prove dramatically harder than single-app tasks, with best models reaching only 37% accuracy versus 52% overall.
Reveals a critical gap: agents struggle when they must coordinate across app boundaries, a core real-world challenge.
-
Privileged vision+XML access improves frontier models by up to 26 percentage points, but does not help smaller open-source models.
Shows that accessibility-tree information is only useful for larger models, suggesting smaller agents lack the reasoning capacity to exploit it.
-
Memory and personalization tasks (46 total) require agents to infer user patterns from personal data like transactions and social relationships.
Tests whether agents can learn and apply implicit user preferences—a key measure of genuinely useful personal assistants.
§03
Synthesis
## The Problem: Phone Agents Don't Know You
Current mobile agent benchmarks treat each task in isolation—like handing a stranger your phone to complete a single instruction. Real personal assistants should know your transaction history, understand your preferences, remember your contacts, and infer patterns from your data. Existing benchmarks fail this test entirely. The authors built iOSWorld to fix that: the first iOS benchmark where agents operate within a persistent user identity with interconnected personal data spanning 26 native apps.
## What iOSWorld Tests
The benchmark contains 133 tasks stratified by complexity. Simple single-app tasks (27) ask agents to complete an action within one app—straightforward baseline tests. Multi-app tasks (60) force agents to navigate between 2 to 8 apps, like booking a flight and then messaging a friend about the trip. The hardest category—memory and personalization tasks (46)—requires agents to extract patterns from personal data and use them to make decisions. An example: "Based on your restaurant preferences, suggest a place for dinner." These tasks demand the agent understand accumulated user behavior.
The underlying apps contain realistic connected data: transaction histories, messages, travel records, social relationships, and financial activity. This architecture ensures that solving a task genuinely requires reasoning about personal context, not just executing interface commands.
## How Well Do Current Models Perform?
The authors evaluated frontier (GPT-4V and similar) and open-source computer-use models in two modalities. Vision-only agents see pixel screenshots. Vision+XML agents also get the app's accessibility tree—a structured representation of UI elements and their properties—simulating privileged system access.
Results reveal sharp limitations. The best configuration achieved 52% overall accuracy but dropped to 37% on multi-app tasks, showing agents struggle with cross-app reasoning. Surprisingly, adding XML accessibility information helped frontier models dramatically—up to 26 percentage points improvement—but smaller open-source models showed little or no gain. This asymmetry suggests frontier models can leverage structured information effectively, while smaller models lack the capability to exploit it.
## Why This Matters
iOSWorld exposes a critical gap: state-of-the-art agents fail at tasks requiring personalization and multi-app coordination. These are exactly the skills that make a phone assistant useful in practice. The benchmark is not a trick—it directly measures the attributes users care about.
The release of iOSWorld as open-source, complete with apps, seeded data, tasks, rubrics, and evaluation code, lowers the barrier for researchers to benchmark their own models. The 26 newly built iOS apps provide enough diversity to test generalization without being so massive that evaluation becomes intractable. For developers and researchers building mobile agents, this benchmark supplies a clear, quantified target: reaching human-level performance on personal, multi-app reasoning tasks.
Mine your own.
Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.