Source: arXiv
Published: 18 June 2026
Runtime: 0:00
Snippets: 4

A conversation between

Lichen Bai , Tianhao Zhang , Shitong Shao , Dingwei Tan , Qiyu Zhong , Zhengpeng Xie , Haopeng Li , Qinghao Huang , Dandan Shen , Tengjiao Ji , Wei Wang , Peicheng Wu , Yuxuan Zhao , Xiangyu Zhu , Welly Luo , Shurui Yang , Zeke Xie

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Source · arxiv.org/watch?v=2606.17800 ↗

0:00 0:00

§02

Snippets

№01

MaineCoon is the first real-time audio-visual autoregressive model optimized for social-interactive applications, achieving 47.5 FPS on a single GPU.

Real-time social interaction requires sub-second latency; previous world models ignored human-centric social dynamics entirely.
№02

A 22B-parameter model enables streaming generation with sub-second interaction latency, supporting thousand-second-scale or longer continuous output.

Long-horizon generation without drift is critical for immersive social experiences; most prior models fail at scale.
№03

Self-resampling, cross-modal alignment, domain-aware preference optimization, and reinforced online-policy distillation accelerate training while maintaining inference performance.

These techniques address efficiency bottlenecks that have prevented social world models from scaling to real-time speeds.
№04

An agentic streaming inference framework with cache management and prompt planning mitigates drift over thousand-second generations.

Controlling degradation in long-form generation is unsolved; this framework enables truly interactive social simulations.

§03

Synthesis

## Why This Matters

Most video generation models target passive viewing—simulating game worlds or physical environments. MaineCoon flips the perspective to prioritize *social interaction*: real-time video with audio where a human can talk to the model and see themselves or a character respond within sub-second latency. On a single GPU, it generates video at 47.5 frames per second, making interactive social applications feasible for the first time at this quality level.

## The Core Challenge

Building a model that handles both audio and video in real time is brutally difficult. You need to: - Generate high-quality frames *and* synchronized audio simultaneously - Keep latency under ~200ms so interaction feels natural - Run efficiently on consumer hardware - Prevent quality degradation or "drift" over long generation sequences (thousands of seconds)

Previous autoregressive models (which generate content one token at a time, like language models) either prioritized quality over speed or sacrificed both for efficiency.

## How MaineCoon Works

The authors built a 22-billion-parameter autoregressive model—essentially a large language model retrained to predict video frames and audio tokens jointly. The key innovations fall into three buckets:

**Training efficiency**: Self-resampling adaptively adjusts token resolution during training to focus on harder samples. Cross-modal representation alignment ensures audio and video "understand" each other in the same embedding space. Domain-aware preference optimization and reinforced online-policy distillation (ROPD) steer the model toward outputs that feel natural for social contexts without slowing it down.

**Inference speed**: Rather than generate greedily frame-by-frame, they deploy an "agentic streaming inference framework" that uses cached states and intelligent prompt planning to batch computation. This lets the model predict multiple timesteps ahead while managing memory footprint on a single GPU.

**Long-form stability**: A thousand-second video would normally accumulate errors. Their agentic cache management—essentially letting the model "reflect" on what it generated and adjust course—mitigates drift over extended sequences.

## The Results

MaineCoon achieves 47.5 FPS with sub-second latency on a single GPU. That's a step-change from prior work: either you get real-time inference at lower quality or high quality at latency measured in seconds. The authors position this as the first "social world model"—one optimized not for passive viewing but for interactive, human-centric scenarios like virtual beings responding to speech in real time.

The claim is architectural and practical: social platforms deserve models built with their constraints in mind. A user talking to an avatar needs synchronous audio-visual output and immediate response, not batch processing.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator