The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.
Discussion
Host: Hello everyone, and welcome back to the podcast! Today, we're diving into the fascinating world of Large Language Models, or LLMs, and trying to answer a pretty fundamental question: do these things really understand what they're saying, or are they just really good at mimicking?
Guest: That's the million-dollar question, isn't it? We've seen LLMs achieve incredible feats, even surpassing humans on some tasks. It's natural to wonder if there's actual understanding happening or if it's all just clever pattern recognition.
Host: Exactly! And that's where our guest's research comes in. They've been working on a project called 'PhysiCo' to try and get to the bottom of this 'Stochastic Parrot' phenomenon. Welcome to the show! Can you give us a bit of background on the 'Stochastic Parrot' idea for listeners who might not be familiar with it?
Guest: Sure thing! The 'Stochastic Parrot' argument, famously put forth by Bender and Koller, suggests that LLMs, despite their impressive abilities, might simply be repeating words based on correlations without possessing true understanding or meaning. It's like a parrot mimicking human speech – it can produce the sounds, but it doesn't necessarily grasp the concepts behind them. And it is important to note that the stochastic parrot argument is not merely about the LLM's lack of consciousness or subjective experience; it questions the LLM's ability to genuinely process information, reason, and make inferences based on understanding. A true understanding entails that the LLM can connect concepts, apply them in new contexts, and recognize the implications of its statements, abilities that the stochastic parrot argument suggests are absent.
Host: So, it's not just about spitting out text that sounds good; it's about whether the model actually grasps the underlying concepts. And until now, there hasn't been a really solid way to prove whether this is happening or not, right? It's been more of a theoretical debate.
Guest: That's right. There's been plenty of discussion, but a lack of concrete experiments providing paired evidence of understanding versus lack thereof. Many studies show LLMs failing at challenging tasks, but that doesn't necessarily prove they understand the underlying concepts when they succeed elsewhere. That's the gap our research aims to fill, by building on the concept of 'summative assessment'.
Host: Summative assessment? That sounds like something straight out of a classroom! How does that apply to evaluating LLMs?
Guest: Precisely! Summative assessment is used to measure students' understanding and knowledge acquisition after a period of learning. Think of it as a final exam. We've adapted this approach to assess how well LLMs grasp concepts. To evaluate whether an LLM truly understands the concept 'Gravity', you would design a series of questions related to the concept of gravity to assess comprehension, e.g., the properties like inverse square law and examples like orbital motions. If a student struggles to answer many of these questions, the teacher may conclude that the student does not understand the concept well or has a poor grasp of it. Similarly, we design different tasks that test varying levels of understanding related to a specific concept and if the LLM struggles in certain aspects but excels in others, we can infer the level of understanding. We design the various tasks based on Bloom’s taxonomy, which covers remembering, understanding, applying, analyzing, evaluating and creating.
Host: Okay, so instead of just asking a general question about gravity, you're designing a whole suite of tests that probe different aspects of understanding. It sounds comprehensive. How did you then translate this into something an LLM can work with?
Guest: That's where PhysiCo comes in. It's essentially a physical concept understanding task focused on 52 common high school physics concepts, things like gravity, light reflection, acceleration, buoyancy, and inertia. For each concept, we created both 'low-level understanding' and 'high-level understanding' subtasks.
Host: Ah, so those are the two levels you use to measure understanding. What does 'low-level' entail in this context?
Guest: Low-level understanding focuses on the ability to recall and rephrase knowledge. We have tasks like 'Physical Concept Selection,' where we give the LLM a Wikipedia definition of a concept with key terms masked out and ask it to identify the concept from a multiple-choice list. We also tested the LLM's ability to recognize physical concepts presented as real-life pictures. Finally, we had the LLMs generate descriptions of the concept based on core properties and representative examples. So in essence, testing the LLM's memory and ability to express those memorized facts.
Host: So basically, testing whether the LLM has memorized the textbook definition. Makes sense. And what about the high-level subtasks? How do you test for understanding beyond memorization?
Guest: This is where it gets interesting. We needed to design tasks that required a deeper understanding of the concepts and avoided simply relying on memorized information. We drew inspiration from the Abstraction and Reasoning Corpus, or ARC, which uses grids – matrices of colored squares – to represent concepts in an abstract way.
Host: Grids? So, instead of text, you're using visual patterns to represent physical concepts? That sounds like a pretty clever way to bypass the LLM's language skills and tap into something else.
Guest: Exactly. The LLM is less likely to have seen grids directly associated with specific physical concepts during its training. So we designed two sets of high-level subtasks using these grids. The first, 'PhysiCo-Core,' focused on representing the core properties and examples of each concept in an abstract grid format. Five annotators created pairs of input and output grids that illustrated the transformations related to the concept. For example, for gravity, we might have a grid showing an object moving downwards. The key is that it's an abstract visualization, not a literal picture.
Host: I can see how that would force the LLM to actually think about the concept rather than just regurgitating a definition. What was the other set of high-level tasks?
Guest: The second set is called 'PhysiCo-Associative.' This is where we took existing grid patterns from the ARC dataset and asked annotators to identify which physical concepts they associated with those patterns. This task is designed to be more subjective and challenging because the grids might contain distracting information, and the association with a specific physical concept might not be immediately obvious.
Host: So, it's like asking the LLM to make a more abstract connection between a visual pattern and a physical concept. It sounds like you're really pushing the models to go beyond rote memorization. How did you turn these grid representations into something the LLMs could actually process?
Guest: For text-based LLMs, we represented the grids as matrices, encoding the colors as numbers and presenting them as a token sequence in the prompt. For multi-modal LLMs, we could directly input the grids as visual images. We then presented the LLMs with three examples of these input-output grid pairs and asked them to choose the correct physical concept from a list of four options, turning it into a four-choice classification task.
Host: Okay, so you've got your task, you've got your data… what models did you run these PhysiCo tasks on?
Guest: We tested a range of models, both commercial and open-source. On the commercial side, we used GPT-3.5, GPT-4, GPT-4o, and Gemini 2.0. For open-source, we included Llama-3, Mistral, InternVL-Chat, and LLaVA-NeXT. This gave us a good cross-section of different architectures and capabilities to see how they performed on our benchmark.
Host: That’s quite a comprehensive lineup! Before we dive into the head-to-head matchups, let's focus on the initial question: did the LLMs perform well on the low-level tasks? Were they able to recall definitions and identify concepts when presented in natural language?
Guest: This is where the 'stochastic parrot' narrative starts to take shape. The short answer is: yes, they performed remarkably well. On the 'Physical Concept Selection' task, GPT models, both text-based and visual-based, achieved near-perfect accuracy, above 95%, in recognizing concepts from Wikipedia definitions and real-life images. Open-source models, like Mistral and Llama-3, also did reasonably well, though not quite as flawlessly as the closed-source models. We suspect that the open source ones are less extensive, which leads to a lower performance.
Host: So, at least when it comes to recalling and identifying concepts based on their textbook definitions, these LLMs seem to have a pretty solid grasp. What about the 'Physical Concept Generation' task? Could they accurately describe the concepts in their own words?
Guest: Yes, in general, the description of concepts are satisfactory, but measuring text generation is actually pretty difficult. But we had human annotators evaluate the generated descriptions for factual accuracy. The results were impressive: GPT-3.5 and GPT-4 generated descriptions with no factual errors, and Mistral only had minor issues. We also developed a self-play evaluation metric, where the LLM tries to identify the concept from its own generated description. The LLMs could accurately recognize the concepts from the descriptions they wrote by themselves, further confirming the accuracy and sufficiency of their knowledge.
Host: Okay, so the LLMs are acing the low-level understanding tasks. They can recall definitions, identify concepts from descriptions and images, and even generate accurate descriptions themselves. It sounds like they have a pretty good handle on the textbook knowledge of these physical concepts. But that's where the easy part ends, right? What happened when you threw the high-level, grid-based tasks at them?
Guest: That's where the wheels start to come off, unfortunately. The performance on the high-level tasks was… significantly lower. It really highlighted the difference between memorizing information and actually understanding it. First, we verified whether the task is actually easy for humans. For each instance in our PhysiCo, we asked three independent annotators who were not involved in our task design to perform the same classification task presented to the LLMs. The results indicate that our tasks are largely solvable to people with a college-level education.
Host: So the questions themselves are valid. So what happened when the machine tried to answer these questions? I suppose there is a significant gap between the performance of human annotators, which indicates good understanding, and machines?
Guest: Exactly! On the PhysiCo-Core tasks, humans achieved an accuracy rate higher than 90%. The PhysiCo-Associative tasks present greater challenges and subjectivity because the annotations are personalized based on the annotators’ individual perspectives and experiences, but humans can still achieve a notable average accuracy of 77.8% in solving these tasks. But then you see, for the remarkable GPT-4, GPT-4o and GPT-4v, their performance is not descent and particularly there is a huge performance gap between them and humans. Even worse, GPT-3.5 and Llama-3 failed to show significant improvement over random performance.
Host: Wow, that's a pretty stark contrast. So, the LLMs that were acing the low-level tasks are now struggling to perform much better than random chance when faced with abstract visual representations of the same concepts. Can you explain the details of such differences to me?
Guest: For high-level tasks on PhysiCo, we found GPT-3.5, Mistral, and Llama-3 nearly at random chance, while the best LLMs are also significantly lower than the understanding level of human annotators. This really emphasizes the presence of the stochastic parrot phenomenon in LLMs. Basically, while these LLMs are seemingly familiar with the definition of those physics terms and have memorized the related knowledge, they actually do not have a good understanding of the core properties and meanings of those terms. This is most clearly presented when the definition and common knowledge are separated from the grid-based problems, which the LLMs failed to address and generalize upon.