Source: arXiv
Published: 18 June 2026
Runtime: 0:00

A conversation between

Jingru Guo , Xiangyuan Xue , Lian Zhang , Wanghan Xu , Siki Chen , Philip Torr , Wanli Ouyang , Lei Bai , Zhenfei Yin

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

§03

Synthesis

## The Core Finding

Different frontier LLMs have complementary strengths—each excels on different scientific questions. Rather than pick one model, SciOrch trains a lightweight 8B "orchestrator" that decides which expert model to call for each sub-problem, achieving 56.66% accuracy on a 240-question frontier science benchmark. This beats the best single commercial model by 3.74 percentage points while cutting API costs in half compared to typical multi-agent approaches.

## Why This Matters

Frontier scientific reasoning—olympiad-level math, chemistry, biology problems—pushes current LLMs to their limits. The key insight is that no single model dominates across all question types. OpenAI's o1 might crush physics problems but stumble on biology; Claude excels elsewhere. Naively running all queries through all models wastes money and latency. The authors discovered a better strategy: train a small, efficient router to diagnose each problem and delegate intelligently.

## The Method's Core Constraint

Standard reinforcement learning for agents relies on cheap rollouts—try many action sequences, learn from rewards. Here, each action is an API call costing real dollars and seconds. Running thousands of trajectories to train a policy is prohibitive. The authors needed a training approach that works with minimal data.

Their solution uses Monte Carlo Tree Search (MCTS) to generate diverse orchestration trajectories offline. MCTS explores which expert to call next by building a search tree, balancing exploration and exploitation. Rather than require full trajectories, they extract single-turn training samples from each node in the tree—a way to squeeze more signal from fewer API calls. They then optimize the orchestrator using GRPO-style training, a gradient-based method related to RLHF (reinforcement learning from human feedback).

The orchestrator itself decomposes multi-step problems: breaking a chemistry question into sub-parts (identify compound, compute properties, synthesize answer), routing each to the most capable expert, and merging results into a final response.

## Validation and Trade-offs

On SGI-Reasoning and Scientists' First Exam datasets, SciOrch reaches 56.66% average accuracy. It beats Claude 3.5 Sonnet (the strongest single model baseline) by 3.74 points and outperforms a multi-agent baseline by 3.33 points. Critically, it achieves these gains while using less than half the API budget of typical multi-agent setups—important when each query costs money and time.

The trade-off is acceptable: an 8B model running locally can coordinate calls to frontier models (Claude, GPT-4, etc.) without the latency penalty of chaining multiple full inference runs. Decomposition also lets the orchestrator specialize, directing geometry problems to one expert and organic synthesis to another.

## Limitations and Scope

The method assumes access to diverse frontier LLM APIs and a curated science benchmark for training. It doesn't address whether the orchestrator learns generalizable routing strategies versus overfitting to benchmark question types. The 240-question test set is modest for drawing broad conclusions, though the gains are consistent across both SGI and SFE subsets.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator