VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) vision-centric alignment stage, which warms up the vision encoder and projector; 2) vision-language pretraining stage, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) multi-task fine-tuning stage, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) video-centric fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We're diving into some cutting-edge research in the world of AI, specifically looking at how machines are learning to see and understand the world around us through images and videos. It’s gonna be a really interesting one, so buckle up!
Host: Today, we're going to be unpacking a fascinating paper called 'VideoLLaMA3: Frontier Multimodal Foundation Models for Image and Video Understanding'. Now, that's quite a mouthful, right? But what it boils down to is a new AI model that's pushing the boundaries of how we teach machines to process visual data. It’s not just about recognizing objects anymore, it’s about understanding the nuances, the context, and even the temporal relationships in what they’re seeing.
Host: I've been digging into this research, and what’s particularly interesting is that these models are not just passively observing, they're actively learning and drawing connections, and it kind of mirrors how we humans understand the world, you know, building up from images to understanding complex video sequences. So, we’ll try to make this journey together today and make it a little bit easier to understand.
Host: Alright, let's jump right in, shall we? The paper itself is a bit of a beast, so we’re going to be focusing on the core ideas behind VideoLLaMA3. The authors introduce this idea of a 'vision-centric' approach, and it's really the backbone of how the entire model is built, and I think it's really crucial to understand it to understand the rest of the work.
Host: So when they say vision-centric, they mean it in two main ways. First, there’s the training approach, where they emphasize high-quality image data. They say, “Look, videos are just sequences of images, right?” So, instead of going after vast amounts of video data, which can be messy and hard to annotate, they’re focusing on getting super clean and detailed image datasets. It kind of makes sense right? If they get the basics right, then it should extrapolate into the complexity of video
Host: And then secondly, there's the framework design. So how they've actually structured the model. Here, they’re trying to build an architecture that’s flexible. It's about getting the most out of images of different sizes and shapes, not just forcing everything into some fixed box. They're also trying to handle videos efficiently, kind of pruning out the repetitive stuff, you know, like when nothing much is changing from frame to frame.
Host: It’s like, imagine trying to explain a movie to someone. You wouldn’t just describe every single frame in painstaking detail. You’d focus on the key moments, the turning points, the stuff that actually matters to the story. That's what this model is trying to do, but with AI. So, first the training paradigm. The model goes through four stages. They are all about optimizing the models to be great at image and then videos.
Host: The first one is what they call the 'vision-centric alignment stage'. This is where they get the vision encoder, the part of the model that's actually processing images, up to speed, by making its output aligned with the large language model. It’s like making sure they are speaking the same language, you know? They use high-quality scene images with short captions, which enhances the encoder's basic performance, and then they add in document images and scene texts to get it to pick up on the finer details like text and structures, which is really crucial when it comes to documents and charts.
Host: The second stage is 'vision-language pretraining'. This is where they throw a bunch of image-text data at it. I mean a lot. They are adding scene images with detailed captions, documents with detailed explanations, charts, and even some bounding boxes to help with spatial reasoning, and a little bit of text-only data. All of this with all parameters unfrozen. It's kind of like they're building up a really rich, multimodal knowledge base, getting the model to understand the world through both visual and textual cues, and how they relate to each other. This is really foundational for the next steps.
Host: Okay, so now, stage three is the 'multi-task fine-tuning stage'. Here's where things start to get more specific. They're taking all that general knowledge that they built up, and they're now starting to fine-tune it on actual tasks. So they're using image-text data with questions and answers, which teaches the model how to be more interactive, and then they also use general video caption data. Now, this is really interesting, even though this is mainly about the images, they use video captioning here to lay the groundwork for video understanding. It's kind of like they're saying, 'Hey, even understanding the story in a video can boost our image smarts'. It's a bit unexpected, but it makes sense when you think about it.
Host: And finally, there’s the fourth stage: 'video-centric fine-tuning'. So, after all that preparation, they're now focusing specifically on video. They're using data like general videos, streaming videos, videos with temporal annotations to show what’s happening in time, and even image-only and text-only data as well. It's like they’re fine-tuning it to be great at video. By doing this final stage, they get the model to be able to really understand dynamic content and all that video complexity.
Host: So, now let’s take a deeper look into the architecture and how it's designed. The core thing they're trying to get across is that a vision encoder should be able to take in images at any resolution. This is because, in real life, you’re gonna have images of all kinds of sizes and shapes, right? So rather than having a fixed size of the positional embedding, the model uses a Rotary Position Embedding, also known as RoPE. This positional embedding can be adapted to varying resolution, which makes the model flexible.
Host: And then for videos, they're trying to be smart about it. Videos have a lot of information, but there's also a lot of redundancy, so they propose to compress the video tokens, meaning reducing the number of these tokens. It’s like saying, 'Hey, we don't need every single bit of detail from each frame. Let’s focus on what’s really changing'. So, they compress this video data so that it’s more compact and the model can focus on the dynamic, the moving parts. This helps the model to focus more on the dynamic parts of the videos, while also saving on computation during both the training and inference stage.
Host: Now that we’ve covered how the model is trained and the overall framework, let’s zoom into the core technical bits. The model itself has two main components they want to highlight: Any-resolution Vision Tokenization (AVT) and Differential Frame Pruner (DiffFP). These two components make this model different from other models, and also enable it to achieve a higher performance.
Host: So first, let's tackle the Any-resolution Vision Tokenization, or AVT, part. In many other models, they use a vision encoder that’s trained for a certain resolution only, meaning they only take images of specific size. This will mean information lost or distortion. And even though there are some techniques that are used to split images into fixed patches, it’s still inflexible and ignores the position relationships in the image. So the authors come up with the AVT. They take the ViT-based encoder, and they use RoPE to replace the absolute position embedding, which enables them to process images of any resolution. And by fine-tuning this encoder, it's compatible with variable resolution and can take advantage of fine details within images.
Host: And then there's the Differential Frame Pruner, or DiffFP. This is the video compression part. You see, after videos are tokenized, they usually end up with a lot of tokens, which can be computationally expensive. So, what they do is they downsample each frame spatially to limit context length. However, videos also consist of frames that overlap with each other, meaning redundant tokens. So, they came up with DiffFP, where they compare the difference in pixel space between consecutive patches of consecutive frames. If the difference is below a certain threshold, meaning not much difference, they remove those patches. It’s like getting rid of the repeated information and focus on the things that are moving or changing. This makes video representation more compact and precise.
Host: Alright, so they’ve got this amazing model, but it's only as good as the data it's trained on. The paper emphasizes the creation of a high-quality re-captioned image dataset, called VL3-Syn7M. This shows their commitment to quality over quantity. They source the images from the COYO-700M dataset and they put them through a rigorous cleaning process. So, they apply different filters to remove the messy, low-quality image data.
Host: The first filter is about the aspect ratio. They remove images that are excessively long or wide, because these unusual shapes can affect the model's understanding. Then there’s the aesthetic score filtering, where they remove the low-quality images. Then they calculate the text-image similarity, and remove those with a low score, meaning it's hard to describe the content concisely. Finally, there’s the visual feature clustering, where they identify different groups of images, and select a fixed number of images from each cluster. This way, they make sure the model is trained on a diverse dataset while maintaining a balance.
Host: And lastly, after all the filtering and cleaning, comes the re-captioning. They generate both a brief and a detailed caption using InternVL2 models, which provides more comprehensive and robust textual data. Through this whole process, they managed to create a clean and diverse image dataset of 7 million image-caption pairs, which shows their dedication to high-quality training data, and how the training paradigm is the most important.
Host: Now, let’s move on to the training stage. So, the VideoLLaMA3 model is comprised of four core components: a vision encoder, a video compressor, a projector, and a large language model, or LLM. The vision encoder takes in visual inputs. The video compressor, as we talked about, reduces the number of tokens for video inputs. The projector bridges the vision encoder and LLM and makes sure they are in the same feature space. And for the LLM part, they used Qwen2.5 models, which is the base LLM they are working from.
Host: They follow a similar four-stage approach like we’ve covered in the training paradigm. The first stage is ‘vision encoder adaptation’, which is mainly about fine-tuning the vision encoder to process images of different resolutions while also aligning the features of vision encoder with the LLM. In this stage, the vision encoder is made trainable, and the language decoder is kept frozen. This way, the vision encoder becomes a more adaptable dynamic-resolution processor and is able to process images of varying resolutions.
Host: Then, there's the ‘vision-language alignment’ stage. This is where all the parameters are now trainable, and they’re trying to align both the vision encoder and LLM to have multimodal knowledge. This stage helps integrate visual and textual information, and improves overall multimodal understanding. Then there’s the ‘multi-task fine-tuning’ stage. Here they are fine-tuning the model with a diverse set of data, including image and video questions, and general video captions. This is all about making the model better at understanding and following instructions. Also in this stage, the video compressor comes into play to reduce video tokens.
Host: And finally, we have ‘video-centric fine-tuning’, and the name speaks for itself. Here they focus on really enhancing the video understanding capabilities. It’s like, everything from now on, from the model’s perspective, will be related to the video data. All parameters are unfrozen and they train with video-text data, image-only data and text-only data. It’s like making the model become a video specialist, and they train it until it’s great at all video tasks.
Host: Okay, so we know how it's trained. Now, let’s talk about how they’re feeding the data to the model. They organize the images, videos and streaming videos in very specific sequences. For the images, they are represented as Image Tokens, with a ‘\n’ to separate different images. Then text tokens follow after the image tokens, separated by ‘\n’ as well. This ensures a proper mix of visual and textual data.
Host: For videos, each frame is represented as a Frame Token. And then, before the tokens of each frame, there’s a Timestamp Token, like “Time: xxs”. This keeps track of the time corresponding to that frame. Frames are separated by commas, and videos are separated by ‘\n’. And lastly, for streaming videos, things are interleaved with video and text tokens. So, similar to videos, timestamps are added before the frame tokens. And like real-life, there will be ‘Answer tokens’, like “GPT: xxx” within the sequence, which mimics the interactive nature of streaming content.
Host: Now, we have to talk about the massive amounts of data they’re using, and the different tasks in each training stage. In the vision encoder adaptation stage, they use scene images, scene text images and document data to enhance the models. The scene images include data such as VL3-Syn7M-short and LLaVA-Pretrain-558K, which enhance overall performance. While the text image and document images enable the model to capture fine-grained details. The data labeled with ‘Recap’ have been generated using InternVL2-8B.
Host: Then, in the vision-language alignment stage, they included 5 different types of data to cover everyday scenarios. They include scene images, scene text images, documents, charts and fine-grained data, as well as a bunch of text-only data. The scene images include high-quality data sources such as COCO-2017 and ShareGPT4o. Scene text images come from diverse Chinese and English text datasets, where each one also includes a bounding box of the text within the image. They also have a document dataset that contains accurate synthetic images. And they also added some chart images since it’s similar to documents and a fine-grained image dataset that includes region captions and grounded captions.
Host: In the multi-task fine-tuning stage, there are also six different types of data: general image data, document data, chart/figure data, OCR data, grounding data, and multi-image data. In each category, they use high-quality datasets and rigorous filtering process to make sure they are effective. For OCR data, they cover both common real-world situations, like the development environment and natural scenarios, and even the instruction-tuning data for OCR has five subtasks. Multi-image data also helps the model handle complex situations, where there are multiple images to comprehend at once.
Host: And finally, the last training stage is the video-centric fine-tuning stage. Here, they aim to tune the model to be a video expert by using large amounts of high-quality video instruction following data. There is a lot of general video data, they expand it with dense captions and questions of video data, and they also include data for streaming video understanding. They also include temporal grounding data to understand the relationships between frames. And lastly, they use image-only and text-only data, to make sure the model doesn’t forget its previous capabilities.
Host: So, that’s a deep dive into the core methodologies behind VideoLLaMA3. I know it's a lot to digest but it really helps paint the picture of how the model works, and why it’s able to achieve such high performance. Let’s move on now to the actual experimentation that was done. They evaluate this model using a variety of benchmarks.