Text-Vision Co-Instructed Image Editing

§03

Synthesis

## The Core Problem

Image editing tools today face a fundamental trade-off. Text prompts like "make the sky orange" are semantically clear but give you no control over *where* the edit happens—the model might change the wrong sky, or blur boundaries badly. Visual prompts like dragging pixels give precise spatial control but are ambiguous: dragging a person's arm could mean "move," "stretch," "deform," or something else entirely. This paper argues you shouldn't have to choose.

## The Solution

The authors propose **TV-Edit**, a framework that pairs text instructions with visual guidance simultaneously. A user might say "make this person taller" *and* draw a bounding box or drag vector showing where the edit should apply. The key innovation is treating text as semantic intent and visual markers as spatial constraints that work together, rather than separately.

To train this approach, the authors built a dataset of over 23,000 paired textual-visual instructions extracted from videos. Videos are valuable here because they provide natural temporal sequences showing how objects move and change—a person walking gives you aligned examples of the same person in different poses and positions, all with consistent semantics. This "dynamic video" source grounds the alignment between what you're saying (the text) and where you're pointing (the visual).

The TV-Edit framework itself takes those paired instructions and lifts them into "semantic-aware control representations"—a middle-ground language the model uses internally. Rather than directly feeding drag vectors or points to a pretrained image editor, the system first contextualizes them using image-text semantics, enriching sparse spatial signals with semantic meaning. This lets existing editing backbones (the underlying generative models) understand both *what* and *where* in a unified way.

## Why It Matters

The results show consistent improvements over text-only and drag-only baselines across multiple editing backbones. The authors also built **TV-Edit-Bench**, a benchmark with ground-truth references and controlled variations, which addresses a real gap: prior evaluation frameworks didn't properly measure whether edits were both semantically faithful (did you actually do what the user said?) and spatially accurate (did you edit the right region?).

The practical upshot: co-instruction editing should reduce frustration. Users no longer face a choice between fluency (text) and control (dragging); they get both. For applications requiring careful spatial manipulation—portrait editing, object repositioning, localized style transfer—this removes a major usability bottleneck. The 23K paired dataset and benchmark are also artifacts the community can reuse, lowering the barrier for future work in this direction.

The insight is simple but powerful: spatial and semantic constraints are complementary, not competing. By explicitly modeling both, the authors sidestep the longstanding limitation that plagued single-modality approaches.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator