CTB-1: Vision Language Models in 2024

1 Introduction

Vision language models (VLMs) are generative models that can learn simultaneously from images and texts to tackle many tasks.

[Figure: a VLM example, source]

VLMs are broadly defined as multimodal models that can learn from images and text. They are a type of generative models that take image and text inputs, and generate text (and/or image) outputs. Large vision language models have good zero-shot capabilities, generalize well, and can work with many types of images, including documents, web pages, and more [source].

There have been plenty of survey papers on VLMs [survey1, survey2]. In this blog, we will skip introducing the basics and move directly to the latest research trend in 2024.

2 Multimodal Design

Generally speaking, there are two main type of designs for VLMs [source]:

Type A: multimodal LLM, or MLLM (visual encoder -> multimodal projector -> LLM / text decoder)

  • First we encode the image or video with an encoder. The visual encoder is often pre-trained, such as with SimCLR, CLIP or SigLIP.

  • Then we pass the compressed vision representations to a multimodal projector to align them with text representations.

  • Finally, the projected vision tokens, often together with text tokens, will be processed by the pre-trained LLM to generate text.

Type A VLMs are usually easier to train, since they utilize pre-trained visual encoder and LLM. Their main focus is to align the visual features to text features, and they rely mostly on the LLM to understand the compressed vision features. However, the text-based LLM cannot easily generate multimodal outputs without additional designs and training.

Type B: multimodal tokenized transformer (discrete tokenizer / encoder -> multimodal transformer / decoder)

  • First we use a tokenizer, such as VQ-VAE, to directly encode the image or video into a sequence of tokens.

  • Then we train a multimodal transformer to process the vision (and text) tokens.

  • With proper decoding and/or diffusion designs, we can directly generate multimodal outputs.

Type B VLMs natively support multimodal output generation. However, they usually need to be trained from scratch and are slower, harder to train.

  • A typical example is the closed-source Gemini-1.5 (its open-source counterpart, PaliGemma, follows the Type A design using a pre-trained Gemma-2B).

  • Another example is Transfusion, which follows a VAE encoder -> Transformer -> VAE decoder design. The image tokens are patch-based, and within the transformer, the attention mask across local image patches are full, while the attention mask across different input sequences (and within each text sequence) are still causal.

In this blog, we will focus mostly on the Type A VLMs, since the research community rarely trains the multimodal transformer from scratch. Below we also show some recent research papers that attempt to unify both designs (e.g. VILA-U, Libra).

3 Architecture

[Figure: a typical MLLM design, source]

Here is a typical MLLM design. With the default setup, in the pre-training stage, we only train the projector (also called connector). In the post-training stage, we sometimes unfreeze the LLM and/or the visual encoder to improve the performance on instruction following. In Llama3.2, the LLM is also frozen in the post-training stage to preserve text-only performance. In MM1.5, both the LLM and the visual encoder are unfrozen. We will dive into the training details in the training section below.

3.1 Visual Encoder

The visual encoder, as experimented by MM1 and LLaVa-Next, primarily serves as a semi-lossless feature compressor. Scaling the encoder size does not have a significant impact on MLLM model performance as long as the compressed features are semi-lossless. However, scaling the image resolution is critical.

One of the main challenges of the visual encoder is to support images with high or different resolution and text-rich information. When using the ViT-based CLIP visual encoder, the image resolution is usually 224x224 or 336x336, unable to support large images. The last layer of ViT, often used as the features of the vision input, is already a large tensor (576 or 768), making it hard to support multiple images. (There are also other challenges for video-primary encoders, which is not the scope of this blog, we may revisit them in a future post.)

Let’s take a look at how recent research papers tackle this issue:

[Figure: LLaVa-OneVision visual representations]

  • MiniCPM-V adaptively partitions the original image into separate parts (1x6, 2x3, …, 4x2, 8x1) and adopts 2D positional embedding interpolation to support image partitions.

  • LLaVa-OneVision first down-samples the original image into the visual encoder input shape, then also partitions the original images into different crops, with bilinear interpolation to reduce input size.

  • InternVL1.5 adopts a very similar method to LLaVa-OneVision, with PixelShuffle to reduce token size.

  • MM1.5 also adopts a very similar method to LLaVa-OneVision, with special treatments on image padding to generate 378×378 image patches with minimal resolution loss.

  • Qwen2-VL supports larger image inputs with an additional MLP after the ViT to reduce token size and uses 2D RoPE as positional encoding.

  • An earlier paper, NaViT, also attempts to summarize how to patch and pack multimodal inputs with large and dynamic size.

Note that different methods focus on different aspects of the challenge, and therefore have pros and cons on different eval tasks. Unavoidably, they all aim to capture as much information from the original image as possible without drastically increasing feature or token size.

Some additional techniques can also be applied to improve text-rich understanding without increasing train-time and/or inference-time model complexity, such as using multiple small vision encoders or mixture-of-experts (MoEs) to improve multitask performance (e.g. BRAVE), mixing convolutions with transformers (e.g. ViTamin), etc.

[Figure: ROSS model design]

A recent paper, ROSS, also proposes a method to apply extra reconstructive visual supervision to help preserve the visual context in the LLM.

3.2 Multimodal Projector

The multimodal projector typically processes the last feature layer of the visual encoder (e.g. last layer of transformer without the CLS token) and generates 1D token-like features for the LLM to consume. Common techniques include:

  • Pooling: apply average or attention pooling directly on the visual features, e.g. MM1

  • A small NN with MLPs or convolutions: learn to adapt modalities, often easier to train, e.g. Honeybee (C/D-Abstractor), MM1, LLaVa-OneVision, InternVL1.5

  • Resampling with cross-attention: combined with 2D position encoding, it learns the domain adaptation and also helps compress token size (for multi-frame or large image support), e.g. MiniCPM-V

Papers like MM1 and LLaVa-OneVision claim that the multimodal projector is less critical to the overall model performance compared with other factors in the model architecture. With other parts of the model scaling up, the multimodal projector may also need adjustments. We may see this trend in the next year.

3.3 LLM / Text Decoder

Since most of the research papers leverage pre-trained LLMs to reduce training cost, such as Llama3.2, Mixtral-8x7B, Qwen2.5 and their variants, we won’t dive into the details of the LLM model design. To reiterate, scaling LLM is still a critical part of model performance, because the visual encoder + projector are essentially converting multimodal inputs into text-like tokens, and the LLM is the core to perform understanding and generation.

3.4 Visual Decoder, or Maybe yet another Multimodal Transformer?

To further generate multimodal outputs, a naive solution is to append a separate text-to-image decoder (diffusion transformer, consistency model, etc) after the MLLM, such as DALL·E 3. Some recent work also proposed methods to integrate the visual decoder into the end-to-end MLLM pipeline:

[Figure: Janus model, the “transformer” here is still a LLM (DeepSeek-LLM 1.3B)]

  • Janus directly adds an image decoder to the LLM next-token prediction output. It also has two separate visual encoders, one focusing on understanding and the other focusing on generation.

[Figure: VisionLLM v2 model]

  • VisionLLM v2 (used in InternVL2) adds a trainable “super-link” to bridge the LLM and task-specific decoders. The decoder also requires an additional fine-tuning round to gain diverse capacities for visual tasks while maintaining effectiveness in global vision understanding.

[Figure: VILA-U model]

  • VILA-U uses a VQ-VAE discrete tokenizer to encode visual inputs, and adds a RQ-VAE visual decoder. Although it is using the LLaMA-2-7B LLM as the multimodal transformer, its architectural design is extremely similar to the Type B VLMs we introduced above.

[Figure: Libra model]

  • Libra adopts a similar architecture, with a hybrid visual encoder to generate both contiguous signal and discrete IDs for visual input, and hacked each LLM layer with a “routed visual expert module” that applies cross-attention on vision and language.

We can see that with the visual decoder fitting smoothly into the multimodal model, the gap between Type A and Type B VLMs are getting smaller. With post-training on the LLM and hybrid visual tokens, they can be even unified into one monolithic design.

4 Pre-training

[Figure: common fine-tuning datasets, source]

As studied extensively in MM1 and MM1.5, pre-training a VLM (assuming not freezing any component) usually needs

  • billions of image-text pairs

  • billions of interleaved image-text data (e.g. Flamingo, for better multimodal understanding and better few-shot and text-only performance)

  • trillions of text-only tokens

  • [optionally] synthetic data

A typical data sampling ratio is 5:1:4, roughly half-and-half for image-centric and text-centric data. A typical batch size is 256 or 512.

For models that use pre-trained visual encoder and LLM, they usually freeze both of them in the beginning (sometimes called the warm-up stage or alignment stage), and only train a proper multimodal projector with millions to hundreds of millions of low-resolution and low-to-mid-quality image-text pairs.

After training the multimodal projector, researchers also find it helpful to add a second pre-training stage (continual pre-training or high-quality knowledge learning), with millions to hundreds of millions of high-resolution, mid-quality, and/or text-rich image-text pairs to boost image understanding. Multilingual captioning data can also be added to improve model’s performance on different languages (e.g. LLaVa-OneVision, MiniCPM-V). Filtering is critical here to ensure the data quality is better than the original large-scale internet-mined datasets.

Unless with high quality text-centric data, the LLM is commonly kept frozen in the pre-training stage.

Both reconstruction loss (on the visual encoder, like AIM) and contrastive loss (on VQA tasks) are commonly used in model training.

5 Post-training

[Figure: common pre-training datasets, source]

Data quality and variety are essential in the supervised fine-tuning (SFT) stage for model post-training.

  • MM1.5 mixes high quality single-frame image, multi-frame image, and text-only data, focusing on text-rich, refer-and-ground, general knowledge, math, and code categories.

  • LLaVa-OneVision and MiniCPM-V adopt a similar SFT scheme to MM1.5 with millions of data samples. They also incorporate additional multilingual data and increase max number of input image patches to support high resolution images.

In the SFT stage, some models (e.g. Llama3.2) freeze the LLM to avoid regression on text-centric or text-only tasks, while others (e.g. MM1.5, MiniCPM-V, LLaVa-OneVision) claim fine-tuning LLMs improves their ability to understand visual features and follow instructions. Similarly, freezing or unfreezing (parts of) the visual encoder also has pros and cons. We can see that with different model setup, data amount and data distribution, there might not be an optimal post-training strategy and we have to make decisions case by case (or equivalently, scaling other aspects are probably more critical to model performance).

[Figure: RLAIF-V framework for hallucination reduction]

Reinforcement learning (RL) based alignment is relatively less studied in the VLM research community, but papers like MiniCPM-V acknowledge the importance of alignment and deploy RL (e.g. DPO in RLAIF-V) to improve grounding and reliability in high-stake scenarios.

6 Efficient VLM

One common use case for VLMs is on-device multimodal understanding, which typically has tight constraints on memory consumption and inference speed. Similar to LLMs, techniques like quantization, KV caching, compilation and configuration optimization (e.g. computation allocation on different cores, utilizing NPUs, see MiniCPM-V), Flash Attention, Paged Attention, local attention with KV-sharing, grouped-query and multi-query attention (GQA and MQA), in-flight batching, speculative decoding (SPD), RMSNorm, and LLM pruning can help reduce memory cost, improve inference speed, and optimize model serving.

Train-time optimizations can also reduce cost of training and fine-tuning models, such as structured sparsity, parallelism, quantization-aware training, LoRA, QLoRA, and LOMO optimizer.

For on-device tasks, the LLM backbone is usually between 1B and 3B (e.g. Llama3.2 and Gemma-2B) with optimizations mentioned above and optionally mixture-of-experts (MoEs) support. To future reduce model size or to inherit from the high-quality large models, model distillation with both soft labels (from teacher model, with temperature adjustments) and hard labels can be applied to help reduce model size while maintaining a comparably good performance. New architectures like Mamba also show up to 43% parameter reduction with comparable performance (e.g. Cobra).

For server-side tasks with large throughput and multiple rounds of queries, distributed system designs and additional stateful caching (e.g. LRU on host memory) are helpful.

7 Benchmarking

[Figure: MMMU data category, 11.5K meticulously collected multimodal questions, with 30 subjects and 183 subfields]

Classical VQA benchmarks suffer from low variance and small sample size. Benchmarks like MathVista, MME, SEED-Bench and POPE enriched one or multiple categories in visual understanding. Recent benchmarks like MMBench, MM-Vet, MMMU and MMMU-Pro (with increased number of answers and better vision data) attempt to provide all-around analyses of VLM performance. See this survey for more breakdowns on benchmarks.

VLM research papers commonly evaluate their model performance on multiple benchmarks, instead of just one or two, to showcase their performance under different dataset preferences. Text-centric and text-only benchmarks (e.g. MMLU) are also helpful to evaluate the performance of the LLM. Some papers separate 0-shot and few-shot evals for better ablation study as well.

As of today, on the MMMU Benchmark, OpenAI o1 achieves the best performance and even surpasses human experts (low), with its capability to perform slow, retrospective, and step-based thinking. Below it come GPT-4o, Claude 3.5 Sonet, Gemini 1.5 Pro, and Qwen2-VL-72B, all of which are closed-source. The best open-source model is InternVL2-Pro, then Llama 3.2 90B and NVLM-H 1.0 72B.

As for image generation quality evaluation, people still heavily rely on qualitative human evaluation (e.g. VisionLLM v2) or compare their model performance with state-of-the-art diffusion-based models like stable diffusion or autoregressive models like LlamaGen. Some quantitative evaluation methods like FID score and CLIPScore can be used to evaluate generated image quality versus ground truth images.

One More Thing…

End-to-end (E2E) VLM is the new foundation model for autonomous driving. With proper prompt engineering such as ego status, traffic scene summary, high-recall bounding box proposals, previous frame predictions, and/or high-level router (path planning) commands, the E2E VLM consumes sensor data sequences and processed text prompts, and generates next-token predictions on various tasks, such as object detection, spatial reasoning, road/scene understanding, and motion planning (with chain-of-thoughts).

[Figure: Emma model design]

Waymo’s Emma is a typical E2E model that unifies perception and planning with VLM. In another paper, DriveVLM, the E2E driving task is separated into two pipelines. One with VLM for slow and hierarchical planning, the other with a classical perception-planning ML pipeline for fast and trajectory planning.

About the Author

Chang Gao (@changgy__), author of CTB (changgy tech blog). Chang has rich research and professional experience in the tech industry, including generative AI, machine learning, computer vision, robotics, on-device software, server-side software, and more.

Previous
Previous

CTB-2: Do we really need DD (discrete tokens and diffusion)?