MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We've got a really interesting topic to dive into, something that's been buzzing around in the AI research community. It's all about how well AI models can understand and reason about videos, especially when it comes to specialized knowledge.
Host: Yeah, so think beyond the usual cat videos or simple actions. We're talking about videos that require expert-level knowledge – like a chemical reaction in a lab, a surgical procedure, or even understanding a complex historical event depicted visually. It's not just about seeing what's happening, but truly understanding the 'why' and the 'how' behind it all. And to help me explore this fascinating topic, we have a special guest with us today!
Guest: Hey Leo, thanks for having me! I'm excited to be here and talk about this, it’s a really crucial area in the development of AI. It’s not enough to just have models that can recognize objects or actions. We need AI that can understand the underlying principles and concepts that are often only clear to specialists or experts in a particular field.
Host: Exactly! It's like, could an AI watch a video of an engine being assembled and not just see the parts moving, but also understand the mechanics, the torque, the purpose of each part, and so on? That’s the kind of understanding we’re getting at. And today, we’re going to be discussing a benchmark, named \gradientRGBMMVU53,93,20310,10,80, which aims to measure exactly this kind of expert-level video understanding.
Guest: Yeah, \gradientRGBMMVU53,93,20310,10,80 is a very interesting project. It really tries to push the boundaries of what we expect from AI models. The researchers who built it, they're not just focusing on surface-level understanding. They are trying to get AI to reason deeply about the content of the videos. And I think it is important, because to this day, many models can't truly grasp those specialized scenarios. They might be good at identifying objects or basic actions, but when you get to those complex domain-specific videos, it’s often a different story.
Host: Right, so it's not just about seeing what's in front of you, but understanding the implications, the scientific principles, the cultural nuances, and everything else that comes with expert knowledge. So let's get into this benchmark. From what I understand, the \gradientRGBMMVU53,93,20310,10,80 dataset, is pretty extensive and well-designed, right? It is not something easily done.
Guest: Absolutely. The dataset includes 3,000 expert-annotated questions covering 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. It's not just random videos either. Each video is carefully selected to require a deep understanding of the subject matter. And the questions aren’t simple multiple choice questions you’d see in a high school textbook. They are designed to test the model's expert-level reasoning, and are validated by human experts to confirm that video comprehension is truly necessary to answer accurately. That's a pretty big deal because it weeds out a lot of the shortcuts AI can take with normal datasets.
Host: That’s a really important point. The validation aspect, I mean. It prevents models from just relying on textual cues or single frames and forces them to really analyze the video. I've seen so many benchmarks where AI can do okay just by looking at the question itself, which totally defeats the purpose of evaluating video understanding. And the disciplines you mentioned, that's quite a diverse set. It means this benchmark is pushing models to be versatile in their reasoning.
Guest: Exactly. And the annotation process is very meticulous. They didn't just throw random videos and questions together. It’s a textbook-guided process. Expert annotators, first, they go through textbooks in their respective fields. Then, they pinpoint key concepts that are best explained visually through video, and then find videos that illustrate those concepts. Then, they create the questions. This ensures that the dataset is rooted in actual subject matter expertise and knowledge and is not just some superficial, arbitrary dataset. Each question is then complemented with expert-annotated reasoning and domain-specific knowledge. This is amazing as it enables researchers to understand how the model reaches its answer and where it goes wrong, not just a simple right or wrong answer. You see, the goal is to make the evaluation more transparent and fine-grained.
Host: That's incredible! So, the experts are not just providing answers; they're providing the step-by-step reasoning, and the knowledge that is needed to arrive at that correct answer. That level of detail is essential for truly evaluating where these models are lacking. When you just have the ground truth answer, you don't really know what the model is struggling with, which limits progress in this field. And it seems like they really thought about data quality. In many data creation processes, compensation is usually done by the number of examples, which can lead to rushed or compromised datasets. But for this one, they took a different approach.
Guest: Yes, that’s right. They compensated annotators based on the time they spent on the annotation task, rather than the number of questions they completed. This way, it encourages them to dedicate their time to ensuring the quality of each example. The researchers understood that creating these examples is inherently time-consuming, especially when finding Creative Commons licensed videos that fit the criteria. So they wanted to incentivize quality over speed, which is quite sensible. They also had experts validate each example. This step is vital for ensuring the questions are accurate and that visual comprehension is indeed needed to get the correct answer. That is, if an example could be answered with just text or a single frame, they either revised it or removed it entirely.
Host: It sounds like they were really thorough about ensuring that the dataset is top-notch. So, with this benchmark, they have 3,000 questions spread across those four main disciplines, right? And they've evaluated a number of models on it. What were some of the key findings from these experiments? Did any model particularly impress, or, on the flip side, disappoint?
Guest: Well, when they tested 32 different models, including some very recent and high-profile ones, they found that even the best ones fell short of human expert performance. For example, the latest System-2-capable models like o1 and Gemini 2.0 Flash Thinking did perform better than the rest, but they still couldn’t match the accuracy of a human expert. The gap with human expertise was quite substantial. It’s not like AI is closing the gap. For instance, GPT-4o achieved a score of 66.7% on the benchmark, while human experts in an open-book setting were scoring about 86.8%. That’s a big difference! It really shows how challenging it is for models to actually understand videos and reason with expert knowledge simultaneously.
Host: That's quite a gap, actually. I mean, we often see models doing really well on standard datasets, sometimes even surpassing human-level performance, but with \gradientRGBMMVU53,93,20310,10,80, it's pretty clear that the field has still got a long way to go, especially when it comes to combining visual understanding with specialized knowledge. Now, you mentioned the 'System-2' models. Can you elaborate on that? Because it's not something that I hear often in general AI discussions.
Guest: Sure, ‘System-2’ refers to the type of thinking that’s more deliberate and analytical. Unlike ‘System-1,’ which is fast, intuitive and mostly unconscious, System-2 is slow, logical, and requires conscious effort. Models that utilize System-2 thinking often employ a chain-of-thought (CoT) approach where they try to reason step-by-step before producing a final answer. In this study, it is the models that employ longer CoT reasoning and more ‘thinking’ time, that showed higher performance, compared to models that simply generate the final answer without a reasoning process. It highlights that this ‘thinking’ style is quite effective. You see, when tackling complex problems, you can't just jump to the answer. You really need that intermediate reasoning, and this is what System-2 and CoT try to emulate.
Host: That makes a lot of sense. It’s like showing your work in math class, you have to go through the thought process, you don’t just jump to the answer. Now, from the data, it seems like this CoT method, it's not always a magical upgrade for every model. Some improve a lot, some improve very little, right?
Guest: Exactly. Some models, like Claude 3.5 Sonnet, saw a massive improvement when using CoT reasoning, with gains of around 11% in accuracy. That’s quite substantial. However, models like GPT-4o only had a marginal improvement. This suggests that CoT is not universally beneficial, that is, not every model can take advantage of it equally well. Some models seem to have a better built-in ability to reason through these steps, while others still struggle, even with the step-by-step approach. It really brings home the fact that there’s more to it than simply telling the AI to think step-by-step. The architecture and pre-training of the model also play a vital role.
Host: That’s a very crucial observation, actually. It really highlights that there isn’t a one-size-fits-all solution in AI. The effectiveness of techniques like CoT depends heavily on the underlying capabilities of the model itself. So, when it comes to those open-source models, how did they perform? Were there any that were close to matching the performance of the big proprietary models, or is there still a significant gap in that space?
Guest: The open-source models, while they’re definitely improving, generally lagged behind the proprietary models. However, there are a couple of standout examples. Qwen2-VL-72B and DeepSeek-VL2 showed quite promising results. They performed at levels that surpassed human performance in closed-book settings. Which is fantastic considering the limited resources that open-source projects usually have. They were also getting quite close to the performance of some of the leading proprietary models. So, there is progress in the open-source space, and it’s important to note that.
Host: That’s encouraging to hear. It’s really important that high-quality models are available to the research community. It allows more people to explore and contribute to the field. Now, when the researchers analyzed the errors that models were making, what sort of problems were they encountering? Was it mostly a lack of domain knowledge, or were there other recurring issues?
Guest: The error analysis was quite detailed and very revealing. They looked at the common mistakes from the top performing models. What they found was that there were six main types of errors, and they were all telling. 18% of errors were related to visual perception, where the model simply failed to accurately interpret the visual information in the video. They might misinterpret the spatial or temporal aspects, or even hallucinate objects or events that weren’t there. They also found that 20% of errors were because models were using or lacking domain knowledge when making visual perception of the video. For instance, they might identify an object but fail to understand what the technical term for that object is. Another 27% of the errors was the misuse of domain knowledge during reasoning, where they might fail to apply the right equations or concepts. So, it’s clear that the issue is not just the visual input or the language processing, but also how the model connects that input with specialized knowledge.
Host: So, it's a mix of visual perception issues and problems with integrating domain knowledge into that perception, both during the initial visual input interpretation, and also later during reasoning. That really illustrates how this task goes beyond just simple vision. I’m curious though, were there models that relied too heavily on the text itself, rather than looking at the video?
Guest: Yes, that was another big one. About 20% of the errors were due to the models relying too much on the textual information provided in the questions. They might focus on specific words in the multiple-choice options, without actually considering the video content at all. This shows a gap in multimodal reasoning, where they're not effectively combining the video and the text. It’s almost like they’re trying to solve a text-based problem when they’re supposed to be solving a video-based problem. There were also a number of logical reasoning errors, where models would contradict themselves in their reasoning process. These errors just show that we are still far from having models that truly understand the world like humans do. This benchmark is definitely showing us where the models are struggling and where research should focus.
Host: Yeah, it really paints a comprehensive picture of the challenges involved here. It’s not just about building better vision models or better language models but creating models that can actually integrate them together to understand and reason with video data. And the fact that all of these mistakes are happening, it’s really indicative of the limitations we’re dealing with. Now, before we wrap up, do you have any final thoughts on this benchmark, and how it can help guide future research in the area?
Guest: I think that \gradientRGBMMVU53,93,20310,10,80 is a very important contribution to the field. It clearly exposes the limits of current AI models in dealing with expert-level, knowledge-intensive video understanding. It is a very challenging task that requires high quality dataset, so the researchers put a lot of effort into the annotation and evaluation processes. The fact that it includes detailed reasoning and knowledge annotations also sets it apart from other video datasets, which allows us to really understand why the models fail, not just that they fail. By showing us these limitations and providing us with the necessary data, it really helps us figure out what the next steps need to be, whether it's better visual perception capabilities, stronger knowledge integration, or more efficient reasoning algorithms. This will surely help us push the boundaries of what AI can achieve with video understanding, especially in specialized and expert-level tasks.