Source: arXiv
Published: 16 June 2026
Runtime: 0:00

A conversation between

Adnan El Assadi , Roman Solomatin , Isaac Chung , Chenghao Xiao , Deep Shah , Manan Dey , Shriya Sudhakar , Zacharie Bugaud , Wissam Siblini , Ayush Sunil Munot , Yashwanth Devavarapu , Rakshitha Ireddi , Michelle Yang , Márton Kardos , Niklas Muennighoff , Kenneth Enevoldsen

MVEB: Massive Video Embedding Benchmark

Source · arxiv.org/watch?v=2606.14958 ↗

§03

Synthesis

## The Gap in Video Understanding Evaluation

Researchers have built dozens of models that turn videos into numerical representations (embeddings)—useful for searching videos, categorizing them, or comparing them. Yet no standard benchmark existed to compare these models fairly across different tasks. The authors created MVEB, a 23-task benchmark covering six major use cases, and discovered something striking: the best model depends entirely on the job. No single approach wins everywhere.

## What They Tested

The benchmark spans diverse tasks: classification (assign a video to a category), zero-shot classification (recognize categories the model never trained on), clustering (group similar videos together), pair classification (decide if two videos are related), retrieval (find videos matching a query), and video question answering (answer questions about video content). They evaluated 33 existing models—from multimodal large language models (MLLMs, like vision-language systems that process both images and text) to specialized retrieval-focused architectures.

The results revealed a clear hierarchy depending on task type. Models combining visual and text reasoning (MLLM-based embeddings) excelled at classification, clustering, pairwise matching, and QA. A different category—multimodal binding models—dominated retrieval and zero-shot tasks. Generative MLLMs without explicit contrastive training (the kind designed to generate captions rather than embeddings) failed badly on cross-modal tasks, suggesting that the architecture and training objective matter enormously.

## The Audio Surprise

One unexpected finding emerged from pairing experiments: audio's value flipped based on *how the original dataset was labeled*. When humans labeled videos using both sight and sound, adding audio to the model improved performance by about six percentage points. When humans labeled based only on visuals, audio actually *hurt* performance by a similar margin—a consistency gap that held across different model families. This suggests that audio-visual models pick up real signals, but those signals only matter when the ground-truth labels reflect them.

## Design and Impact

The authors curated MVEB from a larger pool of 184 tasks, intentionally reducing the benchmark to 23 while preserving diversity—a practical choice balancing evaluation cost against comprehensive coverage. Rather than creating an isolated benchmark, they integrated MVEB into MTEB, an existing unified ecosystem for evaluating embeddings across text, images, audio, and video. This positioning matters: unified evaluation frameworks let researchers compare trade-offs across modalities and task types systematically.

The release includes all 184 tasks, code, and a leaderboard, making it a resource for the research community rather than a one-time analysis.

## Why It Matters

Video embeddings power real applications—YouTube recommendations, video search, content moderation—yet researchers had no common yardstick. MVEB fills that gap and immediately reveals the uncomfortable truth: specialization wins. A team building a video search system needs a different model than one building a video classifier. The benchmark also flags a practical concern: naively adding audio can backfire if your labeled training data wasn't collected that way. These insights should guide future model design and dataset curation in video understanding.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator