The new-generation visual generation paradigm VAR: Visual Auto Regressive (visual autoregressive) has arrived! It enables GPT-style autoregressive models to outperform diffusion models in image generation for the first time, and has demonstrated Scaling Laws and Zero-shot Task Generalization similar to those of large language models:
Paper Title: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
This new work named VAR was proposed by researchers from Peking University and ByteDance, has topped the trending lists on GitHub and Paperwithcode, and has garnered widespread attention from peers:
Currently, the demo website, paper, code, and model have been released:
- Demo Website: https://var.vision/
- Paper Link: https://arxiv.org/abs/2404.02905
- Open-Source Code: https://github.com/FoundationVision/VAR
- Open-Source Model: https://huggingface.co/FoundationVision/var
Background Introduction
In natural language processing, autoregressive models represented by large language models such as GPT and LLaMa series have achieved great success. Especially their Scaling Laws and Zero-shot Task Generalizability are outstanding, initially demonstrating the potential to lead towards "Artificial General Intelligence (AGI)".
However, in the field of image generation, autoregressive models have lagged far behind diffusion models. Recent widely discussed models such as DALL-E3, Stable Diffusion 3, SORA all belong to the Diffusion family. In addition, whether the Scaling Law exists in the visual generation field remains unknown: that is, whether the cross-entropy loss of the test set can show a predictable power-law downward trend as the model scale or training overhead increases is yet to be explored.
The powerful capabilities and Scaling Laws of GPT-style autoregressive models seem to be "locked" in the field of image generation:
Autoregressive models lag behind a host of Diffusion models on generation performance leaderboards
Aiming to "unlock" the capabilities of autoregressive models and their Scaling Laws, the research team started from the intrinsic nature of the image modality, imitated the logical sequence of human image processing, and proposed a brand-new "visual autoregressive" generation paradigm: VAR, Visual AutoRegressive Modeling. This is the first time that GPT-style autoregressive visual generation has outperformed Diffusion models in terms of effect, speed, and scaling capabilities, and has established the Scaling Laws in the visual generation field:
Core of VAR Method: Imitate Human Vision, Redefine Image Autoregressive Order
When humans perceive images or create paintings, they usually first get a global overview and then dive into details. This idea of going from coarse to fine, from grasping the whole to fine-tuning the local parts is very natural:
Human image perception (left) and painting creation (right) Coarse-to-fine logical sequence
However, traditional image autoregressive (AR) uses a sequence that does not align with human intuition (but is suitable for computer processing): a top-to-bottom, row-by-row raster scan order, to predict image tokens one by one:
VAR, on the other hand, is "people-oriented", imitating the logical sequence of human perception or image creation, and uses a multi-scale order from whole to details to gradually generate token maps:
Besides being more natural and aligned with human intuition, another significant advantage VAR brings is a substantial improvement in generation speed: in each autoregressive step (within each scale), all image tokens are generated in parallel at once; cross-scale generation is autoregressive. This allows VAR to be dozens of times faster than traditional AR when the model parameters and image size are comparable. In addition, the authors also observed in experiments that VAR exhibits stronger performance and scaling capabilities than AR.
VAR Method Details: Two-Stage Training
VAR trains a multi-scale quantized autoencoder (Multi-scale VQVAE) in the first stage, and trains an autoregressive Transformer consistent with the GPT-2 architecture (combined with AdaLN) in the second stage.
As shown in the left image, the forward pass details of VQVAE training are as follows:
- Discrete Encoding: The encoder converts the image into a discrete token map R=(r1, r2, ..., rk), with resolutions from smallest to largest
- Continuation: r1 to rk are first converted into continuous feature maps via an embedding layer, then uniformly interpolated to the maximum resolution corresponding to rk, and summed
- Continuous Decoding: The summed feature map is passed through the decoder to obtain a reconstructed image, and trained with a mixture of three losses: reconstruction, perceptual, and adversarial
As shown in the right image, after the VQVAE training is completed, the second-stage autoregressive Transformer training will be carried out:
- The first autoregressive step is to predict the initial 1x1 token map via the start token [S]
- In each subsequent step, VAR predicts the next larger-scale token map based on all historical token maps
- During the training phase, VAR uses standard cross-entropy loss to supervise the probability predictions of these token maps
- During the testing phase, the sampled token maps will be converted to continuous form, interpolated and summed, and decoded using the VQVAE decoder to obtain the final generated image
The authors stated that VAR's autoregressive framework is brand new, and in terms of specific technologies, it has absorbed the strengths of a series of classic techniques such as residual VAE from RQ-VAE, AdaLN from StyleGAN and DiT, and progressive training from PGGAN. VAR is actually standing on the shoulders of giants, focusing on the innovation of the autoregressive algorithm itself.
Experimental Performance Comparison
VAR conducted experiments on Conditional ImageNet 256x256 and 512x512:
- VAR has greatly improved the performance of AR, reversing the situation where AR lagged behind Diffusion models
- VAR only requires 10 autoregressive steps, and its generation speed far exceeds that of AR and Diffusion models, even approaching the high efficiency of GANs
- By scaling up VAR to 2B/3B parameters, VAR has reached SOTA level, demonstrating a brand-new and promising generative model family.
Notably, by comparing with SORA and the foundational Diffusion Transformer (DiT) of Stable Diffusion 3, VAR has demonstrated:
- Better Performance: After scaling up, VAR finally reaches FID=1.80, approaching the theoretical FID lower bound of 1.78 (ImageNet validation set), which is significantly better than DiT's optimal score of 2.10
- Faster Speed: VAR only takes less than 0.3 seconds to generate a 256x256 image, which is 45 times faster than DiT; on 512x512 images, it is even 81 times faster than DiT
- Better Scaling Capability: As shown in the left image, large DiT models show saturation after scaling to 3B and 7B parameters, and cannot get closer to the FID lower bound; while VAR's performance continues to improve as it scales to 2 billion parameters, finally reaching the FID lower bound
- More Efficient Data Utilization: VAR only needs 350 training epochs to outperform DiT trained for 1400 epochs
These evidences of being more efficient, faster, and more scalable than DiT bring more possibilities for the foundational architecture path of the new generation of visual generation.
Scaling Law Experiments
Scaling law can be called the "crown jewel" of large language models. Relevant research has confirmed that during the scaling up of autoregressive large language models, the cross-entropy loss L on the test set will decrease predictably as the model parameter count N, training token count T, and computational overhead Cmin increase, showing a power-law relationship.
Scaling law not only makes it possible to predict the performance of large models based on small models, saving computational overhead and resource allocation, but also reflects the powerful learning ability of autoregressive AR models, as test set performance improves as N, T, and Cmin increase.
Through experiments, the researchers observed that VAR exhibits a power-law Scaling Law almost identical to that of LLMs: the researchers trained 12 models of different sizes, scaling the parameter count from 18 million to 2 billion, with total computational volume spanning 6 orders of magnitude, and the maximum total token count reaching 305 billion. They observed a smooth power-law relationship between test set loss L or test set error rate and N, as well as between L and Cmin, with good fitting results:
During the scaling up of model parameters and computational overhead, the generation capability of the model can be seen to improve gradually (such as the oscilloscope stripes below):
Zero-shot Experiments
Thanks to the excellent property of autoregressive models that they can use the Teacher-forcing mechanism to forcibly keep some tokens unchanged, VAR also exhibits certain zero-shot task generalization capabilities. The VAR Transformer trained on conditional generation tasks can zero-shot generalize to some generative tasks without any fine-tuning, such as image inpainting, image outpainting, and class-condition editing, and achieves certain results:
Conclusion
VAR provides a brand-new perspective on how to define the autoregressive order of images, that is, a coarse-to-fine order from global outlines to local fine-tuning. While aligning with human intuition, this autoregressive algorithm brings excellent results: VAR significantly improves the speed and generation quality of autoregressive models, and for the first time, autoregressive models have outperformed diffusion models in multiple aspects. At the same time, VAR exhibits Scaling Laws and Zero-shot Generalizability similar to LLMs. The authors hope that VAR's ideas, experimental conclusions, and open-source release can contribute to the community's exploration of the autoregressive paradigm in image generation, and promote the development of future unified multimodal algorithms based on autoregression.
Job Openings
The ByteDance Commercialization-GenAI team focuses on developing advanced generative artificial intelligence technologies and creating industry-leading technical solutions including text, images, and videos. By leveraging Generative AI to automate creative workflows, it improves creative efficiency and delivers value-driven results for advertisers, institutions, and creators.
More positions in visual generation and LLM directions are available on the team, welcome to follow ByteDance's job recruitment information.
This is a discussion topic separated from the original thread at https://juejin.cn/post/7358289528551391268










