VARestorer: One-Step VAR Distillation
for Real-World Image Super-Resolution

Tsinghua University ICLR 2026
*Equal contribution    Corresponding author
Street Scene
VARestorer output for the street scene Real degraded input for the street scene
Landscape
VARestorer output for the landscape Real degraded input for the landscape
Corgi Portrait
VARestorer output for the corgi portrait Real degraded input for the corgi portrait

Left: real degraded input|Right: VARestorer one-step output|Drag the white handle or use  /  to compare.

VARestorer restores severely degraded real-world images into sharp, photorealistic outputs in a single forward step, by distilling a pre-trained text-to-image Visual Autoregressive model.

1 step
one-pass inference
10×
faster than VAR baseline
27.3M
trainable params (1.2%)
0.23 s
per 512×512 image

Abstract

Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in the ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model.

By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K, while accelerating inference by 10× compared to conventional VAR inference.

Framework

VARestorer framework

Overview of VARestorer. (a) VARestorer employs a VAR distillation framework for real-ISR. During training, we use a pre-trained text-to-image VAR as the teacher to predict high-quality tokens and minimize a token-level KL divergence against the one-step student model. (b) To fully exploit the LQ input, we introduce cross-scale pyramid conditioning that lets the student apply full attention across features produced by the multi-scale VAE. At inference, VARestorer generates high-quality results in a single step.

Key Ideas

1. One-step VAR Distillation via Distribution Matching

Multi-step VAR inference suffers from severe error accumulation on ISR, where the output must precisely align with the input content. We distill the autoregressive teacher \(F_{\mathcal{T}}\) into a one-step student \(F_{\mathcal{S}}\) by aligning their token-level distributions across all scales with a KL objective \(\mathcal{L}_{\text{KL}}=\sum_k \mathrm{KL}\bigl(p_{\mathcal{T}}(r_k\mid r_{\mathrm{HQ},<k})\,\|\,p_{\mathcal{S}}(\hat r_k\mid r_{\mathrm{LQ}})\bigr)\), combined with perceptual and MSE losses. The student learns a direct one-pass mapping from LQ input to all HQ tokens, eliminating iterative error propagation.

2. Cross-scale Pyramid Conditioning

A fine-tuned VAE encoder produces a multi-scale token pyramid of the LQ image. We replace VAR's block-wise causal attention with full cross-scale attention, letting coarse structures and fine textures reinforce each other bidirectionally. This preserves the pre-trained VAR's architecture while fully exploiting LQ conditioning signals at every scale.

3. Parameter-efficient Adaptation

We inject LoRA adapters (rank 32) into the cross-/self-attention modules and freeze the rest of the VAR backbone. Only 27.3M parameters (1.2% of the Infinity-2B transformer) are trainable, retaining the generative prior while enabling fast adaptation to the ISR task.

Quantitative Results

Comparison on synthetic DIV2K-Val and real-world DrealSR / RealSR benchmarks. Best and second-best values are highlighted. The number after each method denotes its inference steps.

Dataset Method PSNR↑ SSIM↑ LPIPS↓ MANIQA↑ MUSIQ↑ NIQE↓ CLIPIQA↑ LIQE↑ QALIGN↑ FID↓
DIV2K-Val DiffBIR-50 21.480.50500.3670 0.566469.875.003 0.73034.3464.07032.75
SeeSR-50 21.970.56730.3193 0.503668.674.808 0.69364.2744.03525.90
PASD-20 22.310.56750.3296 0.437167.784.581 0.64593.9473.89535.47
ResShift-15 22.660.58880.3077 0.369358.906.916 0.57153.0823.30930.81
OSEDiff-1 22.060.57350.2942 0.441067.964.711 0.66804.1173.92626.34
SinSR-1 22.520.56800.3240 0.421662.776.005 0.64833.4933.55335.45
VARSR-10 22.410.57240.3177 0.517371.485.977 0.73304.2823.85333.86
VARestorer-1 21.080.53550.3131 0.559072.324.410 0.76694.6644.36331.11
DrealSR DiffBIR-50 24.050.58310.4669 0.554366.146.329 0.70724.1013.734180.4
SeeSR-50 25.820.74050.3174 0.512865.096.407 0.69054.1263.754147.3
PASD-20 26.140.74660.3081 0.440462.346.126 0.62933.6033.572164.1
ResShift-15 24.480.68030.4169 0.323250.778.941 0.53712.6292.877159.7
OSEDiff-1 25.850.75480.2966 0.465764.696.464 0.69623.9393.746135.4
SinSR-1 25.830.71570.3655 0.390155.646.953 0.64473.1313.135172.7
VARSR-10 26.050.73530.3536 0.536168.146.971 0.72154.1373.480156.5
VARestorer-1 24.310.68940.3584 0.563869.495.494 0.78104.5824.188149.7
RealSR DiffBIR-50 23.330.61800.3650 0.558369.285.839 0.70544.1013.760130.8
SeeSR-50 23.600.69470.3007 0.543769.825.396 0.66964.1363.789125.4
PASD-20 24.830.72470.2709 0.442366.935.349 0.58153.5753.705131.9
ResShift-15 23.670.69310.3451 0.353856.908.331 0.53502.8913.111129.5
OSEDiff-1 23.590.70740.2920 0.471669.085.652 0.66854.0703.801123.5
SinSR-1 24.500.70760.3219 0.404561.076.319 0.61783.2003.299140.8
VARSR-10 24.120.70010.3216 0.546571.166.063 0.70044.1483.551130.6
VARestorer-1 22.780.64530.3249 0.565571.374.763 0.74234.6014.180117.2

VARestorer delivers the strongest perceptual metrics (MANIQA / MUSIQ / NIQE / CLIPIQA / LIQE / QALIGN) across all three datasets while running at 1-step inference, achieving the best RealSR FID as well.

Parameter & Computation Analysis

Method Trainable Params. Inference Time (s) GFLOPs MANIQA↑ MUSIQ↑
DiffBIR380.0M10.2712,1170.566469.87
SeeSR749.9M7.1832,9280.503668.67
PASD625.0M4.5814,5620.437167.78
ResShift118.6M1.132,7450.369358.90
OSEDiff8.5M0.181,1330.441067.96
VARSR1101.9M0.630.517371.48
VARestorer 27.3M 0.23 1,536 0.5590 72.32

With only 27.3M trainable parameters and 0.23 s / image, VARestorer is ~10× faster than VARSR and orders-of-magnitude cheaper than diffusion baselines, while attaining the best perceptual quality.

Qualitative Comparisons

Qualitative comparisons
Real-world cases. VARestorer recovers sharp petal structures, natural rock and water textures, and clean boundaries with fewer artifacts than diffusion- and GAN-based baselines, despite using only 1 inference step.
More qualitative comparisons
More qualitative comparisons. Across metal grates, animal fur, wet petal textures and stone carvings, VARestorer produces consistently sharper structures and richer high-frequency detail.

Ablation Studies

Ablation visual
Removing distillation, cross-scale attention, or the KL distribution matching loss each visibly degrades texture fidelity and structural consistency.
Method LPIPS↓ MUSIQ↑ NIQE↓ CLIPIQA↑
w/o distill0.372362.226.2830.4794
w/o cross-scale0.422463.726.0290.3910
w/o ℒKL0.321469.734.3720.6682
VARestorer (full) 0.3131 72.32 4.410 0.7669

Each of the three components—one-step distillation, cross-scale attention and KL distribution matching—contributes meaningfully to the final performance on DIV2K-Val.

Balanced Reconstruction Across Frequencies

Frequency analysis
VARestorer faithfully reconstructs both high-frequency details (top: flower petals) and low-frequency structures (bottom: desert dunes and sea). Each pair shows the low-quality input on the left and our restored result on the right.

Generalization to More Tasks

Deraining and low-light enhancement
Deraining & low-light enhancement. With slight fine-tuning, VARestorer transfers its generative prior to tasks beyond blind super-resolution.
JPEG and heavy compression
Heavy compression, social media and motion blur. VARestorer removes JPEG artifacts and recovers clean textures under diverse in-the-wild degradations.
Salt & Pepper noise removal
Unseen degradations. Though salt-and-pepper noise is absent from our training pipeline, the distilled generative prior still produces clean outputs.
High-resolution restoration
High-resolution restoration. VARestorer scales to 1024×1024 via fine-tuning and to 1800×1200 via tiled inference, preserving fine textures at high resolution.
Real-world restoration showcase
In-the-wild restoration showcase. Real low-quality inputs (top row) and VARestorer outputs (bottom row): cars behind blur, landscapes through fog, and heavily compressed pet photos.

BibTeX

@inproceedings{zhu2026varestorer,
  title     = {VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution},
  author    = {Zhu, Yixuan and Ma, Shilin and Wang, Haolin and Li, Ao and
               Jing, Yanzhe and Tang, Yansong and Chen, Lei and Lu, Jiwen and Zhou, Jie},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=T2Oihh7zN8}
}