CTB-1: Vision Language Models in 2024
1 Introduction
Vision language models (VLMs) are generative models that can learn simultaneously from images and texts to tackle many tasks.
There have been plenty of survey papers on VLMs [survey1, survey2]. In this blog, we will skip introducing the basics and move directly to the latest research trend in 2024.
2 Multimodal Design
Generally speaking, there are two main type of designs for VLMs [source]:
Type A: multimodal LLM, or MLLM (visual encoder -> multimodal projector -> LLM / text decoder)
First we encode the image or video with an encoder. The visual encoder is often pre-trained, such as with SimCLR, CLIP or SigLIP.
Then we pass the compressed vision representations to a multimodal projector to align them with text representations.
Finally, the projected vision tokens, often together with text tokens, will be processed by the pre-trained LLM to generate text.
Type A VLMs are usually easier to train, since they utilize pre-trained visual encoder and LLM. Their main focus is to align the visual features to text features, and they rely mostly on the LLM to understand the compressed vision features. However, the text-based LLM cannot easily generate multimodal outputs without additional designs and training.
Type B: multimodal tokenized transformer (discrete tokenizer / encoder -> multimodal transformer / decoder)
First we use a tokenizer, such as VQ-VAE, to directly encode the image or video into a sequence of tokens.
Then we train a multimodal transformer to process the vision (and text) tokens.
With proper decoding and/or diffusion designs, we can directly generate multimodal outputs.
Type B VLMs natively support multimodal output generation. However, they usually need to be trained from scratch and are slower, harder to train.
A typical example is the closed-source Gemini-1.5 (its open-source counterpart, PaliGemma, follows the Type A design using a pre-trained Gemma-2B).
Another example is Transfusion, which follows a VAE encoder -> Transformer -> VAE decoder design. The image tokens are patch-based, and within the transformer, the attention mask across local image patches are full, while the attention mask across different input sequences (and within each text sequence) are still causal.
In this blog, we will focus mostly on the Type A VLMs, since the research community rarely trains the multimodal transformer from scratch. Below we also show some recent research papers that attempt to unify both designs (e.g. VILA-U, Libra).
3 Architecture
3.1 Visual Encoder
The visual encoder, as experimented by MM1 and LLaVa-Next, primarily serves as a semi-lossless feature compressor. Scaling the encoder size does not have a significant impact on MLLM model performance as long as the compressed features are semi-lossless. However, scaling the image resolution is critical.
One of the main challenges of the visual encoder is to support images with high or different resolution and text-rich information. When using the ViT-based CLIP visual encoder, the image resolution is usually 224x224 or 336x336, unable to support large images. The last layer of ViT, often used as the features of the vision input, is already a large tensor (576 or 768), making it hard to support multiple images. (There are also other challenges for video-primary encoders, which is not the scope of this blog, we may revisit them in a future post.)
Let’s take a look at how recent research papers tackle this issue:
Note that different methods focus on different aspects of the challenge, and therefore have pros and cons on different eval tasks. Unavoidably, they all aim to capture as much information from the original image as possible without drastically increasing feature or token size.
Some additional techniques can also be applied to improve text-rich understanding without increasing train-time and/or inference-time model complexity, such as using multiple small vision encoders or mixture-of-experts (MoEs) to improve multitask performance (e.g. BRAVE), mixing convolutions with transformers (e.g. ViTamin), etc.
3.2 Multimodal Projector
The multimodal projector typically processes the last feature layer of the visual encoder (e.g. last layer of transformer without the CLS token) and generates 1D token-like features for the LLM to consume. Common techniques include:
Pooling: apply average or attention pooling directly on the visual features, e.g. MM1
A small NN with MLPs or convolutions: learn to adapt modalities, often easier to train, e.g. Honeybee (C/D-Abstractor), MM1, LLaVa-OneVision, InternVL1.5
Resampling with cross-attention: combined with 2D position encoding, it learns the domain adaptation and also helps compress token size (for multi-frame or large image support), e.g. MiniCPM-V
Papers like MM1 and LLaVa-OneVision claim that the multimodal projector is less critical to the overall model performance compared with other factors in the model architecture. With other parts of the model scaling up, the multimodal projector may also need adjustments. We may see this trend in the next year.
3.3 LLM / Text Decoder
Since most of the research papers leverage pre-trained LLMs to reduce training cost, such as Llama3.2, Mixtral-8x7B, Qwen2.5 and their variants, we won’t dive into the details of the LLM model design. To reiterate, scaling LLM is still a critical part of model performance, because the visual encoder + projector are essentially converting multimodal inputs into text-like tokens, and the LLM is the core to perform understanding and generation.
3.4 Visual Decoder, or Maybe yet another Multimodal Transformer?
To further generate multimodal outputs, a naive solution is to append a separate text-to-image decoder (diffusion transformer, consistency model, etc) after the MLLM, such as DALL·E 3. Some recent work also proposed methods to integrate the visual decoder into the end-to-end MLLM pipeline:
We can see that with the visual decoder fitting smoothly into the multimodal model, the gap between Type A and Type B VLMs are getting smaller. With post-training on the LLM and hybrid visual tokens, they can be even unified into one monolithic design.
4 Pre-training
As studied extensively in MM1 and MM1.5, pre-training a VLM (assuming not freezing any component) usually needs
billions of image-text pairs
billions of interleaved image-text data (e.g. Flamingo, for better multimodal understanding and better few-shot and text-only performance)
trillions of text-only tokens
[optionally] synthetic data
A typical data sampling ratio is 5:1:4, roughly half-and-half for image-centric and text-centric data. A typical batch size is 256 or 512.
For models that use pre-trained visual encoder and LLM, they usually freeze both of them in the beginning (sometimes called the warm-up stage or alignment stage), and only train a proper multimodal projector with millions to hundreds of millions of low-resolution and low-to-mid-quality image-text pairs.
After training the multimodal projector, researchers also find it helpful to add a second pre-training stage (continual pre-training or high-quality knowledge learning), with millions to hundreds of millions of high-resolution, mid-quality, and/or text-rich image-text pairs to boost image understanding. Multilingual captioning data can also be added to improve model’s performance on different languages (e.g. LLaVa-OneVision, MiniCPM-V). Filtering is critical here to ensure the data quality is better than the original large-scale internet-mined datasets.
Unless with high quality text-centric data, the LLM is commonly kept frozen in the pre-training stage.
Both reconstruction loss (on the visual encoder, like AIM) and contrastive loss (on VQA tasks) are commonly used in model training.
5 Post-training
Data quality and variety are essential in the supervised fine-tuning (SFT) stage for model post-training.
MM1.5 mixes high quality single-frame image, multi-frame image, and text-only data, focusing on text-rich, refer-and-ground, general knowledge, math, and code categories.
LLaVa-OneVision and MiniCPM-V adopt a similar SFT scheme to MM1.5 with millions of data samples. They also incorporate additional multilingual data and increase max number of input image patches to support high resolution images.
In the SFT stage, some models (e.g. Llama3.2) freeze the LLM to avoid regression on text-centric or text-only tasks, while others (e.g. MM1.5, MiniCPM-V, LLaVa-OneVision) claim fine-tuning LLMs improves their ability to understand visual features and follow instructions. Similarly, freezing or unfreezing (parts of) the visual encoder also has pros and cons. We can see that with different model setup, data amount and data distribution, there might not be an optimal post-training strategy and we have to make decisions case by case (or equivalently, scaling other aspects are probably more critical to model performance).
6 Efficient VLM
One common use case for VLMs is on-device multimodal understanding, which typically has tight constraints on memory consumption and inference speed. Similar to LLMs, techniques like quantization, KV caching, compilation and configuration optimization (e.g. computation allocation on different cores, utilizing NPUs, see MiniCPM-V), Flash Attention, Paged Attention, local attention with KV-sharing, grouped-query and multi-query attention (GQA and MQA), in-flight batching, speculative decoding (SPD), RMSNorm, and LLM pruning can help reduce memory cost, improve inference speed, and optimize model serving.
Train-time optimizations can also reduce cost of training and fine-tuning models, such as structured sparsity, parallelism, quantization-aware training, LoRA, QLoRA, and LOMO optimizer.
For on-device tasks, the LLM backbone is usually between 1B and 3B (e.g. Llama3.2 and Gemma-2B) with optimizations mentioned above and optionally mixture-of-experts (MoEs) support. To future reduce model size or to inherit from the high-quality large models, model distillation with both soft labels (from teacher model, with temperature adjustments) and hard labels can be applied to help reduce model size while maintaining a comparably good performance. New architectures like Mamba also show up to 43% parameter reduction with comparable performance (e.g. Cobra).
For server-side tasks with large throughput and multiple rounds of queries, distributed system designs and additional stateful caching (e.g. LRU on host memory) are helpful.
7 Benchmarking
VLM research papers commonly evaluate their model performance on multiple benchmarks, instead of just one or two, to showcase their performance under different dataset preferences. Text-centric and text-only benchmarks (e.g. MMLU) are also helpful to evaluate the performance of the LLM. Some papers separate 0-shot and few-shot evals for better ablation study as well.
As of today, on the MMMU Benchmark, OpenAI o1 achieves the best performance and even surpasses human experts (low), with its capability to perform slow, retrospective, and step-based thinking. Below it come GPT-4o, Claude 3.5 Sonet, Gemini 1.5 Pro, and Qwen2-VL-72B, all of which are closed-source. The best open-source model is InternVL2-Pro, then Llama 3.2 90B and NVLM-H 1.0 72B.
As for image generation quality evaluation, people still heavily rely on qualitative human evaluation (e.g. VisionLLM v2) or compare their model performance with state-of-the-art diffusion-based models like stable diffusion or autoregressive models like LlamaGen. Some quantitative evaluation methods like FID score and CLIPScore can be used to evaluate generated image quality versus ground truth images.
One More Thing…
End-to-end (E2E) VLM is the new foundation model for autonomous driving. With proper prompt engineering such as ego status, traffic scene summary, high-recall bounding box proposals, previous frame predictions, and/or high-level router (path planning) commands, the E2E VLM consumes sensor data sequences and processed text prompts, and generates next-token predictions on various tasks, such as object detection, spatial reasoning, road/scene understanding, and motion planning (with chain-of-thoughts).
About the Author
Chang Gao (@changgy__), author of CTB (changgy tech blog). Chang has rich research and professional experience in the tech industry, including generative AI, machine learning, computer vision, robotics, on-device software, server-side software, and more.