Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

Source · arxiv.org/watch?v=2606.15514 ↗

0:00 0:00

§02

Snippets

№01

RL4IL uses reinforcement learning to rank relevant expert demonstrations and soft fusion to aggregate their actions for robust robotic imitation learning.

Eliminates the need for policy network retraining, reducing computational overhead while handling sensor failures in real-world deployment.
№02

When a modality drops out at inference, dedicated per-modality RL policies retrieve donor demonstrations and reconstruct missing embeddings via soft imputation without system retraining.

Enables graceful degradation under sensor failures—a critical requirement for reliable robot operation in uncontrolled environments.
№03

Experiments on LIBERO benchmarks show RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions.

Demonstrates practical advantage on established robotic manipulation tasks, validating the approach for real-world relevance.
№04

The method uses Proximal Policy Optimisation over Breadth-First Search candidate sets to train retrieval policies that rank demonstrations.

Combines established RL algorithms with structured search, making the approach interpretable and sample-efficient.

§03

Synthesis

## The Core Problem and Solution

Real robots fail when sensors break. A camera might go dark, a microphone cuts out, or a depth sensor gets occluded—yet today's imitation learning systems, trained on complete multimodal data, simply collapse under these conditions. RL4IL tackles this by never training a traditional policy network. Instead, it retrieves relevant expert demonstrations at test time and learns what to do through reinforcement learning, making it naturally robust to missing modalities without retraining.

## How RL4IL Works

The method operates in two stages: action selection under normal conditions, and modality reconstruction when sensors fail.

**Normal inference:** Given a robot observation with all modalities present (camera, language, etc.), RL4IL doesn't predict actions directly. Instead, it searches a library of training demonstrations to find the most relevant ones. An RL policy—trained using Proximal Policy Optimisation—ranks candidate demonstrations by considering how well they match the current observation. A soft cross-attention fusion head then aggregates the actions from the top-ranked demonstrations into a final action prediction. This retrieval-and-fuse approach means the system never needs to learn a generalizable policy; it just finds and copies what works.

**Missing modality handling:** When a sensor fails at deployment, the system doesn't break down. Instead, a dedicated per-modality RL retrieval policy kicks in: it finds training demonstrations that have complete information and ranks them by how similar their *available* modalities are to the current observation. A soft imputation head then reconstructs the missing embedding using cross-attention over these top-ranked donors. The reconstructed embedding fills the gap, and normal inference proceeds. Critically, this requires no retraining of any component—the per-modality policies were trained once during setup.

## Why This Matters

The experiments on LIBERO benchmark suites (three suites covering diverse manipulation tasks) show substantial gains over state-of-the-art imitation learning baselines when sensors drop out. The method trades computational cost (retrieval at every step) for robustness and simplicity. It sidesteps the brittleness of learned policies by building on a library of demonstrations; if a new scenario resembles something in the training set, the system can handle it.

The practical upside is significant: deployment of robotic systems in the real world demands fault tolerance. Cameras occlude, network connections fail, hardware degrades. RL4IL addresses this head-on with an architecture that doesn't require policy retraining for every new sensor configuration. The soft fusion and imputation mechanisms—using cross-attention—are the technical novelty that allows graceful degradation: missing information is reconstructed probabilistically rather than naively dropped or zero-filled.

The lack of policy network training is a double-edged sword: it simplifies deployment and avoids the catastrophic forgetting problem, but it requires maintaining a large demonstration library and performing retrieval at inference time, which may be slow for large-scale systems.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator