Source: arXiv
Published: 18 June 2026
Runtime: 0:00
Snippets: 4

A conversation between

Haoran You , Yotam Nitzan , Lingzhi Zhang , Yifan Gong , Mang-Tik Chiu , Connelly Barnes , Yan Kang , Yuqian Zhou , Eli Shechtman , Sohrab Amirghodsi

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

Source · arxiv.org/watch?v=2606.13898 ↗

0:00 0:00

§02

Snippets

№01

HiLo-Token achieves 3.13x speedup on Diffusion Transformers for image editing by adaptively compressing tokens based on spatial frequency.

DiT latency dominates generative image editing pipelines; practical speedups enable real-time creative tools in production systems.
№02

The method allocates full token budget to user-specified editing regions while using high-frequency selection outside and 16x downsampled tokens for low-frequency components.

Input-adaptive allocation balances locality preservation with global context, avoiding one-size-fits-all compression trade-offs.
№03

Speedups scale with mask ratio: 3.13x for small masks, 2.59x for medium, 1.67x for large, maintaining generation quality across all categories.

Performance scales gracefully with task complexity, showing the method's robustness across diverse editing scenarios.
№04

DiT modules account for 73% of latency even after distillation from 50 to 8 timesteps, making token efficiency critical for editing tools.

Identifies the bottleneck in modern generative editing pipelines and motivates why compression targets the right component.

§03

Synthesis

## The Problem: Diffusion Transformers Are Bottlenecks in Image Editing

Image editing tools like Photoshop's Generative Fill are computationally expensive. When Adobe and similar services switched from convolution-based U-Nets to Diffusion Transformers (DiTs)—neural networks that process images as sequences of tokens—latency exploded. Even after aggressive optimization (reducing timesteps from 50 to 8), the DiT alone consumes 73% of total model latency. For a production service handling millions of edit requests daily, this is a hard ceiling on user experience.

The core bottleneck: DiTs process *every* pixel region equally, treating a complex foreground the same as a blurry background. When a user masks a small area for editing, the model still spends computation on uninformative parts of the image.

## The Solution: Adaptive Token Allocation by Frequency

HiLo-Token compresses tokens intelligently by recognizing that images contain two types of information with different computational needs.

**High-frequency tokens** capture local details—edges, texture, fine structure. These matter everywhere, but especially in the user-specified editing region.

**Low-frequency tokens** encode global structure—overall color, blur, shape. These can be heavily compressed outside the editing region without losing context.

The method works in three steps:

1. **Preserve the editing region**: For pixels within the user's mask (dilated slightly for context), keep all tokens. This ensures strong locality around where changes actually happen.

2. **Downsample low-frequency areas**: Outside the mask, extract tokens from a 16× downsampled copy of the image. This drastically cuts tokens while preserving the blurry global structure needed for coherence.

3. **Select high-frequency details outside the mask**: Use spatial frequency analysis to identify and retain high-frequency tokens in the background. This captures important local details (a tree, a face edge) without processing the entire image at full resolution.

The result is an adaptive token budget that scales with the edit region's size and complexity—exactly what production systems need.

## Why This Matters

The speedups are substantial: **3.13×, 2.59×, and 1.67× DiT speedups** on NVIDIA A100 GPUs for small, medium, and large editing masks respectively. Critically, these gains come with no quality regression—the generated images look identical to the uncompressed baseline.

This matters because latency directly drives user satisfaction in interactive tools. A 3× speedup on a 2-second operation drops it to 650ms, crossing the psychologically important threshold where interaction *feels* real-time. At scale, 3× throughput improvement also means fewer GPUs needed, cutting infrastructure costs.

The approach is elegantly simple—no retraining required, no complex learned routing networks. It's a careful exploitation of how humans perceive images: we tolerate quality loss in boring regions as long as the edited area is perfect. By allocating tokens adaptively, HiLo-Token makes DiTs practical for production image editing without sacrificing the quality that creative professionals demand.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator