Vision-language models~(VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.

Host: Hey everyone, and welcome back to another episode of the podcast! Today, we're diving into the fascinating world of Vision-Language Models, or VLMs for short. It's a field that's exploding with innovation, but also facing some significant challenges. We've got a fantastic guest today to unpack all of this for us – Dr. Dongzhan Zhou. Dr. Zhou, welcome to the show!

Guest: Thanks for having me, Leo! Excited to be here.

Host: So, Dr. Zhou, your recent work on Critic-V is generating quite a buzz. Could you give our listeners a quick overview of what Critic-V is all about?

Guest: Sure. Critic-V is a new framework we developed to improve the reasoning abilities of VLMs. These models are amazing at understanding images and text together, but they can sometimes make mistakes – hallucinating details, or just not making logical connections. Critic-V works by using a two-part system: a 'Reasoner' and a 'Critic'. The Reasoner is the main VLM, trying to answer questions based on images and text prompts. The Critic then steps in and provides feedback, not just a simple right/wrong, but actual natural language critiques. It’s like having a smart editor looking over the Reasoner's shoulder.

Host: That's a really interesting approach. The 'actor-critic' method has been around in reinforcement learning for a while, but applying it to VLMs with natural language critiques seems novel. Can you elaborate on the methodology? How does this feedback loop work practically?

Guest: Absolutely. The Reasoner starts by generating a response based on a prompt. Think of this prompt as the instructions. Then, the Critic evaluates the response and provides a critique. This critique isn’t a simple score, but actual descriptive feedback: 'You missed a key detail in the image', or 'Your reasoning here isn't entirely logical'. The Reasoner then refines its response based on this critique, incorporating it into an updated prompt. This iterative process, inspired by reinforcement learning, enables the Reasoner to continuously learn and improve its responses. We're essentially guiding the VLM's reasoning process with more sophisticated feedback than traditional methods allow.

Host: So, how did you train this Critic model? What kind of data did you use, and how did you evaluate its effectiveness?

Guest: Training the Critic was a key challenge. We used a technique called Direct Preference Optimization, or DPO. This means we didn't just give it right/wrong answers, but instead pairs of critiques – one better than the other – and trained it to rank them. To generate this data, we employed what we call the 'Vision Error Insertion Technique', or VEST. We used GPT-4 to introduce errors into correct answers and then had several VLMs generate critiques identifying these errors. We used a Rule-based Reward system and the Jaccard index to score the quality of these critiques and create preferences. The result was our critique-VQA dataset, a large collection of question-answer pairs along with the ranked critiques – quite extensive, 29,012 multimodal question-answer pairs to be exact.

Host: Wow, 29,012 pairs! That's a substantial dataset. And what were the results? How did Critic-V perform against other state-of-the-art VLMs?

Guest: We tested Critic-V with several models, and the results were impressive. Across numerous benchmarks, including RealWorldQA, MMBench, and several focused on mathematical reasoning like MathVista and MathVerse, Critic-V consistently improved the performance of the VLMs it was paired with – Qwen2-VL-7B and DeepSeek-VL-7B, significantly outperforming their baseline versions in most cases. In fact, we saw improvements of up to 17.8% on MathVista. Even against strong closed-source models like GPT-4V, Critic-V often yielded better results. This shows that our approach isn’t just about tweaking existing models, but fundamentally improving their reasoning process through external feedback.

Host: That’s incredibly compelling. It sounds like the token consumption for the Critic is relatively low, so this is not adding a huge computational burden, right?

Guest: Correct. We found that the additional token consumption for the Critic's feedback is surprisingly low – only a few dozen tokens per critique, on average. This means Critic-V can significantly improve performance without adding a large computational overhead. We've detailed this in the appendix, along with visualizations of the training process and examples from our critique-VQA dataset. We also ran ablation studies to confirm that the improvements aren't just from our specially designed prompts, but are indeed due to the Critic-V framework.

Host: This is truly groundbreaking work, Dr. Zhou. It seems to address a critical limitation of current VLMs, and offers a practical and scalable solution. Let's delve a little deeper into the specific details of your…

Guest: ...

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Summary

Discussion