- Source
- arXiv
- Published
- Runtime
- 0:00
- Snippets
- 4
§02
Snippets
-
Robots learn reusable skills through self-directed play before receiving task instructions, building a persistent code library for later use.
Moves robots from purely task-driven learning to continual skill acquisition, mimicking how animals and humans develop competence.
-
RATs agents autonomously propose exploratory tasks, execute policies, verify progress, diagnose failures, and distill successes into code skills during play.
Shows that agentic systems can drive their own learning curriculum without external guidance, improving sample efficiency and generalization.
-
Play-learned skills improve downstream task performance by 20.6 points on LIBERO-PRO and 17.0 points on MolmoSpaces versus baselines.
Demonstrates measurable benefit of pre-play conditioning in realistic robotic benchmarks, not just toy domains.
-
Learned skills transfer to other Code-as-Policy agents via in-context retrieval without model finetuning, gaining 8.9–8.8 points on RoboSuite and real robots.
Skills become modular and reusable across different agents and domains, increasing practical applicability of the approach.
§03
Synthesis
## The Core Innovation
Robots today learn tasks reactively—they need explicit instructions to acquire skills. This paper flips that model: robots should *play first*, building up a reusable skill library through self-directed exploration, then tap that library when real tasks arrive. The result is substantial: 20.6 percentage-point improvement over the baseline agent on LIBERO-PRO benchmarks, and the learned skills transfer to other agents without retraining.
## How Playful Learning Works
The system, called RATs (Robotics Agent Teams), operates in two phases.
**Play phase:** The agent doesn't wait for task instructions. Instead, it proposes exploratory tasks to itself—things like "move the gripper to the table corner" or "push the object forward." For each self-generated task, the agent writes robot-control code (Code-as-Policy, meaning the policy is executable program text), executes it on the robot, observes what happened, and decides whether it succeeded. Crucially, when attempts fail, the system diagnoses *why*—parsing error messages, visual feedback—and retries with fine-grained hints. Successful executions get saved as reusable code snippets into a persistent skill library.
**Test phase:** When a downstream task arrives (e.g., "arrange objects on a shelf"), the agent retrieves relevant skills from its play-learned library, incorporates them into the context, and uses them to accelerate solving the new task. Importantly, the underlying language model stays frozen; skills are just added to the prompt.
The key insight is that play generates diverse, low-stakes experience. The agent discovers what works and what doesn't before facing real objectives, building a foundation of verified behaviors.
## Why This Matters
**Practical transfer:** The learned skills aren't locked to one model or one task. Testing shows they improve performance on held-out tasks (+20.6 points on LIBERO-PRO), and they also boost other Code-as-Policy agents by 8.9 points on RoboSuite and 8.8 points on real-world transfer—without any finetuning. This modularity is rare in robot learning.
**Efficiency at scale:** Robots accumulate skills during idle periods (play time) rather than requiring explicit task definitions upfront. As the skill library grows, downstream task solving becomes faster and more reliable.
**Biological plausibility:** The approach mirrors how animals and children learn—play is a natural, low-pressure way to explore and acquire competence before facing real-world demands.
The experiments span simulated benchmarks (LIBERO-PRO, MolmoSpaces, RoboSuite) and real robot setups, consistently showing that play-learned skills outperform both no-play and random-play baselines. The 17–20 percentage-point gains suggest that the self-directed exploration strategy is substantially more effective than passive or undirected learning.
Mine your own.
Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.