RefGC-SR^2: Reference-guided Generated Content Super-Resolution and Refinement

§03

Synthesis

## The Core Problem

Reference-guided generation tasks—like compositing an object into a new scene or customizing its appearance—face a hidden bottleneck: the high-resolution reference image that users provide gets downsampled to low resolution before the generative model even sees it. This destroys fine details before generation happens. Then the generator adds its own artifacts (identity distortion, blurring). Existing fixes either refine artifacts in low resolution or recover resolution by assuming natural image degradation, missing the specific artifacts generative models actually produce. The authors identify this gap and propose a solution: reuse the original high-resolution reference at the end to simultaneously upscale, recover lost details, and fix generative artifacts.

## How RefGC-SR² Works

The approach has two main pieces. First, the authors built a real-world dataset for this task—no such benchmark existed. They created a "diptych-conditioned generator" that synthesizes paired low-quality examples that standard pretrained models can't naturally produce, giving the method something to learn from.

Second, they trained a frequency-aware diffusion transformer model. Diffusion transformers are neural networks that iteratively denoise images; the "frequency-aware" part means the model understands that high-frequency information (fine details) and low-frequency information (broad structure) need different handling. The key insight: the method selectively pulls fine details from the original high-resolution reference while actively removing the generative artifacts. This is different from naive upsampling or artifact removal alone—it does both in tandem, treating them as a unified problem.

The model takes three inputs: the low-resolution generated output (which has artifacts), the low-resolution downsampled reference (which preserves structure), and the original high-resolution reference (which has the fine details). It then produces a high-resolution, artifact-free result.

## Why This Matters

Current pipelines waste information. Users provide crisp reference images, but that quality is thrown away before generation even starts. Fixing this gap means better object identity preservation (the generated object looks more like the reference) and sharper, more detailed outputs. The authors show their method beats both reference-guided refinement baselines (which stay low-res) and reference-guided super-resolution baselines (which ignore generative artifacts). For practical applications—e-commerce product customization, design tools, content creation—this means fewer manual touchups and more usable outputs. The work also establishes the first benchmark for this specific problem, enabling future research to build on cleaner ground.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator