Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with other additions, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that combines multiple search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for all geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 https://dpmd.ai/imo-silver. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.
Discussion
Host: Hello everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We're diving into the fascinating world of AI and its ability to tackle incredibly complex mathematical problems. Specifically, we'll be unpacking a groundbreaking piece of research from Google DeepMind about a new system called AlphaGeometry2, which is making waves in the field of automated theorem proving.
Guest: Hey Leo, thanks for having me! Yeah, AlphaGeometry2 is seriously impressive. It's basically an AI that can solve Olympiad-level geometry problems, and not just solve them, but solve them at a gold medalist level. That's a huge leap forward.
Host: Absolutely! It's mind-blowing. So, for our listeners who might not be familiar, the International Mathematical Olympiad, or IMO, is a super prestigious competition for high school students. These problems are notoriously difficult, requiring deep understanding and creative application of mathematical concepts. And geometry, being one of the core areas, is often seen as a good benchmark for testing AI reasoning abilities. Before AlphaGeometry, these problems were seen as largely insurmountable.
Guest: Exactly. The difficulty comes from needing both a solid grasp of geometric principles and the ability to come up with non-obvious solutions. It's not just about plugging in formulas; it's about clever deductions, insightful constructions, and, honestly, a bit of creative problem-solving. That's what makes it so challenging for AI.
Host: Right, so this isn't just about brute-force computation. It's about mimicking, and even surpassing, human-level intuition and ingenuity. Now, the team at DeepMind has built on their previous work with AlphaGeometry1 to create AlphaGeometry2. What exactly are the key improvements that allow this new version to achieve such impressive results?
Guest: Well, the paper highlights several crucial areas. Firstly, they expanded the domain language of AlphaGeometry to handle a broader range of problems. Basically, they taught it more ways to describe geometric concepts and relationships.
Host: Okay, so they gave it a richer vocabulary. What does that vocabulary consist of? What are the 'words' in this geometry language?
Guest: In the original AlphaGeometry, there were only nine basic 'predicates' -- like cong
for congruence, perp
for perpendicular, para
for parallel, coll
for collinear, and so on. These predicates were enough to cover about 66% of the IMO geometry problems from 2000 to 2024. However, the original language couldn't deal with linear equations involving angles, distances or problems requiring movements of points or lines. It also could not solve problems that simply asked to “Find the angle…”. AlphaGeometry2 adds predicates to solve for angles and ratios, and also contains predicates to deal with linear equations.
Host: So the upgrade was necessary to deal with more advanced problems?
Guest: Exactly! For instance, to handle linear equations, they introduced predicates like distmeq
, distseq
, and angeq
. These let the system express relationships like 't1 * log(A1B1) + t2 * log(A2B2) + ... + yn = 0', which is crucial for problems involving complex angle or distance relationships. AG2 also addresses 'locus problems' - problems that talk about the movement of points, lines or circles - a new predicate syntax including a '*' token to serve as the fixed point placeholder. So, it can now reason about how the position of one element affects others as it moves along a defined path.
Host: That makes a lot of sense. These problems can be hard for humans, as well. So by introducing these additional tokens, the AI can describe the problem more fully. What else was added?
Guest: They've added more explicit predicates to represent diagram checks, so-called topological conditions for non-degeneracy. Things like sameclock
, to check the clockwise direction between points, noverlap
, to ensure points are distinct, and lessthan
, for comparisons of segment lengths. These checks are important for avoiding invalid deductions due to degenerate cases. Consider the scenario where you mistakenly assume that the point of intersection of two lines exist, but those lines are parallel. These additional predicates make the AI more robust to such errors.
Host: These are more minor, and intended to double check work. So, what about problems where the diagram construction isn't straightforward? The paper mentions that AG1 was limited to 'constructive problems' where you could build the diagram step-by-step. How does AG2 handle non-constructive problems?
Guest: That's a great question! In AG1, each point had to be definable by at most two predicates - intersection of two objects. This limited it to problems where the diagram could be built step-by-step. In AG2, they relaxed this constraint, allowing points to be defined by at least three predicates. This means the diagram construction becomes non-trivial.
Host: How do they automate this non-trivial diagram construction process?
Guest: The AI utilizes a numerical optimization approach, similar to previous work, but with a critical upgrade. The AI represents all point coordinates as a vector and encodes diagram constraints as non-linear functions. Then, it searches for a coordinate configuration that satisfies these constraints. The AG2 method has two stages. First, ADAM gradient descent optimization is run on the mean-squared error of the constraints to achieve an approximate solution. Then, the solution is further refined by performing the Gauss-Newton-Levenberg method. This two-stage procedure avoids issues in tuning the gradient descent optimization, improving efficiency and reducing the error margin. Overall, these new predicates increased language coverage from 66% to 88%.
Host: That sounds like a clever solution! It's like telling the AI, 'Here are the rules of the game; now find a configuration that obeys them.' So, expanding the domain language and automating diagram generation were key improvements. What about the symbolic engine itself? The paper mentions it being 'stronger and faster'.
Guest: The symbolic engine, which they call DDAR, Deductive Database Arithmetic Reasoning, is the core component of AlphaGeometry. It's what actually does the logical deductions based on the geometric rules. Making this faster and more robust was crucial.
Host: And how did they achieve that?
Guest: There were three main improvements. The first was enabling it to handle 'double points'. This is essential for tackling complex problems. Imagine a problem where you intersect two lines at a point X, and you want to prove X lies on a circle. You could re-formulate the problem by constructing a new point X' as the intersection of one of the lines and the circle. If you can prove X' also lies on the other line, you've shown that X and X' are the same point, and therefore X lies on the circle.
Host: So you are basically proving points being non-distinct. That makes the problem much easier. What are the others?
Guest: The second improvement involved algorithm optimization. The original DDAR algorithm had a candidate searching step that was polynomial in the number of points, and a clause matching step that was exponential. To speed things up, they hard-coded search for essential rules, reducing the number of queries. And thirdly, they completely re-implemented the DDAR core in C++, resulting in over a 300x speedup compared to the previous Python implementation! A faster symbolic engine allows for more aggressive filtering.
Host: Wow, 300x! That's a massive performance boost! So, faster deductions and better handling of double points. What about the training data for the language model? The paper mentions 'better synthetic training data'.
Guest: That's right. AlphaGeometry relies heavily on a language model that's trained on a huge amount of algorithmically generated synthetic data. This data teaches the model how to approach geometry problems and suggest relevant deductions. The synthetic data starts by sampling a random diagram. Then, the symbolic engine is used to deduce all possible facts. This way, the system is not limited to human problem-solving abilities.
Host: What improvements were made to the way that synthetic data was generated?
Guest: They scaled up the resources for data generation to extract much more complex problems. The AG2 data generating algorithm produces problems and theorems that are more complex, i.e. contain a greater number of points and premises. It produces proofs that contain 10 times more proof steps. Additionally, AG2 has a more balanced data distribution between problems with and without auxiliary points. They also improved the algorithm speed and covered more types of theorems.
Host: I guess it's like feeding the AI a much richer and more diverse diet of geometric scenarios. Does that lead to more creative problem solving?
Guest: Definitely! A more comprehensive training set prepares the model for a broader range of challenges. This new system contains 'locus' type problems, which were not present previously. The previous version of AlphaGeometry did not support locus type problems, so the new synthetic training data is a significant upgrade!
Host: Now that's fascinating. This paper also talks about a novel search algorithm. Can you tell me more about that?
Guest: Of course! In AlphaGeometry1, they used a simple beam search to discover proofs. In AG2, they designed a novel search algorithm - they call it Shared Knowledge Ensemble of Search Trees. SKEST involves running several differently configured beam searches in parallel, and allowing them to help each other through a knowledge-sharing mechanism.
Host: It sounds like the AIs are working together to learn from their mistakes.
Guest: Yes, it does! It works as follows. In each search tree, a node corresponds to one attempt at auxiliary construction, followed by a symbolic engine run. If it works, all searches terminate. If it fails, the facts are written down and stored in a shared database. These searches work together, making the search for an answer more efficient. To improve system robustness, they use multiple language models for each search tree configuration.
Host: So, they are basically exploring multiple avenues at the same time, and sharing information between them? What are these different search trees?
Guest: Yes! Let me tell you about the various types of search trees that make up the Shared Knowledge Ensemble of Search Trees. First, they have what they call a 'classic' search tree, the same beam search used in AlphaGeometry1, where the AI produces one auxiliary point at each node. Then they have a tree that can predict multiple auxiliary points, which increases the tree search depth. They also have a tree that predicts different types of auxiliary points, another that is deep but narrow, and one that is shallow but wide. So, overall, the system is robust and can explore the search space.
Host: This system sounds like it involves some real team work!
Guest: Right? Well, the final major improvement in AlphaGeometry2 is the language model itself. As you know, the language model is critical for suggesting potential steps and auxiliary constructions. They needed to make it even better.
Host: How did they enhance the language model?
Guest: For AlphaGeometry1, a custom transformer was trained in two phases. For AG2, they leverage the Gemini training pipeline and simplify training to one phase: unsupervised learning on all data. Their new language model is a sparse mixture-of-expert Transformer-based model that builds on Gemini and is trained on the larger AlphaGeometry2 dataset. In essence, the improvements allow the model to leverage the Gemini foundation to solve complex geometrical problems.