CTB-2: Do we really need DD (discrete tokens and diffusion)?

The first D: Discrete visual tokens

Many of the multimodal large models we see today are utilizing discrete tokens to represent multimodal data.

  • For text, they are naturally discrete (see Yann’s claim here).

  • For visual data like images and videos, people initially simply used pooling-based or cross-attention based visual projectors (like Flamingo, one of the first MLLMs back in 2022) to connect continuous latent visual features with the LLM/decoder. Later we observe a trend of applying an extract step like VQ or FSQ to convert continuous latent visual features to discrete tokens, then directly feeding the discrete tokens to the LLM/decoder.

[Figure: VQ-VAE architecture]

Benefits of using discrete visual tokens include:

  • Sharing the same discrete token format as text, easier to apply optimization methods developed for LLMs.

  • Discrete visual tokens can be directly used for next-token prediction tasks, such as autoregressive-style image/video generation.

    • More technically, generation tasks require a probability distribution p(xt|x1, x2,…,xt-1) with an integral of 1. This is extremely hard to be modeled by continuous functions, but much easier on discrete values with a simple softmax layer.

  • Reducing the dimension of the visual features. Methods like FSQ can even reduce the feature dimension to single digits. This makes computations at latent level (transformer, diffusion, etc) much faster and memory-efficient.

However, discretizing visual latent features will cause information loss (see experiment results from Du et al, Table 1), and known to be harder to train. Here’s an example:

  • Assuming information entropy of each bit in a single image channel is around 3 (see ImageNet).

  • Given an encoded image of size 28x28 (8x downsampling from the typical 224 image input size), it’s entropy is then 28x28x3x3 = 7056.

  • Using a VQ scheme with a vocabulary size of V and a length of L, we need to satisfy L x Log2(V) >= 7056.

  • If we use a sequence length L of 256, then V needs to be 2^27 = 1e8! This is huge.

  • Scaling V is inefficient since it under logarithm, and also subject to vanishing gradient effect (due to inefficient clustering). Scaling L (sequence length) will increase LLM cost.

  • Contrarily, information entropy of continuous-valued images comes from gaussian noises, and theoretically can represent infinite amount of noise.

To use a smaller vocabulary size and improve quality, people often reduce latent patch size (which then further compresses the images) and apply additional losses like L2, perceptual, GAN losses (e.g. VQGAN) to make discrete tokens work and accommodate information loss.

The second D: Diffusion

Diffusion (and its variants like Flow Matching) is one of the mostly adopted methods for image generation. The original diffusion method is directly applied on images, then we have improvements like latent diffusion to compress features and reduce cost, classifier-free guidance to improve quality, flow-matching to ease training and inference, diffusion DiT to unify diffusion and transformers, etc.

[Figure: Stable Diffusion 3 Architecture]

One of the most common pipelines is Stable Diffusion 3 (SD3), where the diffusion transformer consumes encoded text tokens, timestamp t, position encoding, and encoded noise latent as inputs, then generate denoised latents.

Diffusion itself does not have constraint on input format. Both discrete tokens and continuous token will work.

On the other hand, there’s always a genre of diffusion-free image/video generation, by using autoregressive models. It’s easier for text tokens to be predicted one by one, but pixel-by-pixel prediction is not natural in the image space and usually yields compromising performance. Meta’s Transfusion attempts to perform patch-by-patch image prediction in the MLLM transformer, but still requires diffusion to denoise patches. It also demonstrates that with proper setup, we can even feed original size patches into the LLM/decoder without a downsampling encoder.

No more DD?

For MLLM with understanding tasks only, since discretizing the visual features will almost inevitably lead to information loss, it’s intuitive to remove the quantization step and directly use continuous visual features in the LLM/decoder.

For MLLM with both understanding and generation tasks, when the generation is using diffusion-like methods, we can also dump the quantization step. Recent MLLMs like Meta’s Movie Gen, Meta’s Transfusion, Tencent’s Hunyuan Video, and Deepseek’s JanusFlow all just use VAE to encode and decode continuous visual latent embeddings.

For MLLM with both understanding and generation tasks, when not using any diffusion, models like Emu3 can also perform autoregressive next-token prediction by converting visual embeddings to discrete tokens. It’s very straightforward, each 512x512 image/4-frame-video is encoded to or decoded from 4096 discrete tokens, processed and predicted token-by-token.

The last piece is then autoregressive AND quantization-free MLLM understanding and generation. This is a relatively new area, here we introduce two methods.

The first method is using an autoregressive model with diffusion loss. Starting from the quantization-free Transfusion and JanusFlow, which both integrate the diffusion process into the MLLM, i.e. directly use the LLM/decoder to perform diffusion, and train the LLM/decoder with both understanding task losses and diffusion losses. However, since the diffusion happens in the transformer, denoising is very slow. DeepMind’s MAR introduced a method to drastically reduce denoising cost by auto-regressively predicting the condition vector z, and use an extremely small and shallow diffusion MLP to decode the noise x_t, with diffusion loss on p(x|z).

[Figure: DeepMind’s MAR, diffusion loss]

Also, one issue with Transfusion’s patch-by-patch prediction is that, with larger patch size, to compress latent embedding, we often observe quality drop; with smaller patch size, the raster-scan order patch sampling is counter-intuitive and will lose spatial information (adding 2D RoPE can help, but harder/slower to train). To tackle this, DeepMind’s MAR proposed a masked + random order generation scheme to iteratively recover the image. Their latest paper FLUID shows using continuous visual tokens + MAR can outperform SD3 on the GenEval benchmark (0.68 vs 0.70), and using continuous tokens are indeed better than discrete tokens in the autoregressive setup (see its Figure 6).

[Figure: DeepMind’s MAR, masked random order auto regression]

The second method is diffusion-free multi-step decoding. Unlike MAR’s masked random-order prediction, Bytedance’s VAR proposed a “next-scale” prediction scheme to generate image step by step, from small size/low resolution to large size/high resolution. For each scale’s image/embeddings of size (h, w), all (i, j) tokens are generated in parallel, conditioned on previous scales and its position embedding. Although the paper still adopts VQ and discrete tokens, we can apply methods described above to replace them with continuous tokens.

[Figure: VAR architecture]

Maybe in near future, we will see more elegant methods that unifies all modality’s understanding and generation in the same format, without the necessity of quantizing modalities or slow diffusion.

Previous
Previous

CTB-3: LLM Heuristics

Next
Next

CTB-1: Vision Language Models in 2024