Summary

Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through Outcome REwArd-based reinforcement Learning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future researchhttps://github.com/InternLM/OREAL.

Discussion

Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's topic. We're diving deep into the world of AI and mathematical reasoning – specifically, how we can push the limits of what AI can achieve when solving complex math problems. It's a fascinating area, and it has implications far beyond just acing your calculus exam.

Guest: Absolutely, Leo! Math reasoning is a cornerstone of general intelligence, both for humans and AI. The ability to break down a complex problem, identify the relevant information, and apply the right formulas and techniques is crucial in so many fields, from scientific research to financial analysis. Plus, it's a great testbed for AI because we can easily verify the final answer.

Host: Exactly! And what's really interesting is that while companies like OpenAI have made huge strides in this area with their models, a lot of the specific details about how they achieve those results are kept under wraps. But it's generally believed that reinforcement learning and 'chain of thought' techniques are playing key roles. So, we're going to unpack a recent research paper that tackles this challenge head-on, exploring how far we can go with reinforcement learning and outcome-based rewards.

Guest: Sounds great! I think it's really important to have open research in this area. Relying solely on proprietary models limits progress and makes it harder for the wider AI community to contribute and innovate. It also creates a dependency on specific companies, which isn't ideal for long-term development.

Host: I couldn't agree more. Okay, let's get into the paper itself. It's titled 'Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning,' and it's by Chengqi Lyu, Songyang Gao, and a whole team of researchers. The core idea they're exploring is how to effectively train AI models to solve math problems using only a simple 'right' or 'wrong' outcome as feedback. Think about it – that's often the only information you get when you're solving a problem yourself. You either get the correct answer, or you don't.

Guest: That's a really interesting and challenging setup. Because in a complex math problem, there can be many steps involved. So, if you only get a binary outcome reward at the end, it can be tough for the model to figure out which steps were helpful and which ones led it astray. It's like trying to teach someone to bake a cake by only telling them if the finished cake tastes good or bad, without giving them any feedback on their individual actions like mixing the batter or setting the oven temperature.

Host: A perfect analogy! That's exactly the problem of sparse rewards that the researchers are trying to address. Their proposed solution is a new reinforcement learning framework they call OREAL – Outcome REward-based reinforcement Learning. And the exciting thing is, they've achieved some really impressive results, even surpassing some much larger models. They actually got a 7B model to perform on par with 32B models on a benchmark math problem dataset. A 7B model achieving that level of accuracy is super impressive!

Guest: Yeah, parameter count isn't everything, it seems! So how does OREAL actually work? What are the key components of their approach that allow them to get such good results with relatively smaller models?

Host: Okay, so the paper outlines a few key methods. First, they theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. Now, that's a mouthful, but basically, it means that if you sample a bunch of different solution paths and then focus on learning from the ones that led to the correct answer, you can actually achieve the best possible performance, at least within certain constraints.

Guest: Ah, okay, that makes sense. So, it's like saying that if you try a problem multiple times and only study the solutions that worked, you'll eventually learn the optimal approach. The 'best-of-N' sampling is just a way to generate a diverse set of potential solutions to choose from. And the 'KL-regularized optimal policy' part is just a way of saying they're trying to find the best policy while also keeping it similar to some initial, reference policy to prevent wild, unstable learning.

Host: You nailed it! And that's actually a really clever insight because in math problem solving, every correct answer is, in a sense, equally valid. As long as the logic is sound, it doesn't matter how you arrived at the solution. So, focusing on positive examples makes a lot of sense in this context.

Guest: But what about the incorrect solutions? Surely there's information to be gained from analyzing why a particular approach failed.

Host: That's a great question, and the researchers address that directly. They argue that when learning from negative samples, it's crucial to reshape the rewards to ensure consistency between the sampling distribution and the target distribution. In simpler terms, they need to adjust how they penalize incorrect solutions to account for the fact that they're not sampling them in the same way they're sampling the correct ones. It’s about maintaining consistent gradient estimation.

Guest: So, they're not just treating all incorrect solutions the same. They're weighting them differently based on how likely they were to be sampled in the first place. That sounds like a way to compensate for the fact that 'best-of-N' sampling inherently under-samples negative gradients – the information you get from incorrect attempts. Smart! It's like saying, 'Okay, this solution was wrong, but it was a really plausible wrong solution, so we need to learn from it more than we would from a completely nonsensical attempt.'

Host: Exactly. By doing that, they can optimize their policy using both successful and failed trajectories, making the learning process more efficient and robust. But there's still the challenge of the sparse reward. Even with reward shaping, you're still only getting a single 'right' or 'wrong' signal at the end of a potentially very long chain of reasoning.

Guest: Right. And that's where the third key component of OREAL comes in, right? The token-level reward model. It's got to be tricky to assign credit or blame to individual tokens.

Host: You got it. To tackle this, OREAL uses a lightweight credit assignment scheme based on a token-level reward model. This model is trained to estimate the importance of individual tokens in the reasoning trajectory, essentially breaking down the overall advantage into step-wise contributions. The tokens are assigned importance weights, which helps in learning. It's not the same as a value function, but it is related.

Guest: So, it's trying to figure out which words or phrases were most crucial in leading to the correct or incorrect answer. That's a really interesting approach. How does that token-level reward model actually work? Is it analyzing the semantics of the tokens or looking for specific keywords or patterns?

Host: From what I understand, the token-level reward model is trained using the outcome rewards, but it learns to predict the probability of correctness based on the sequence of tokens generated so far. By doing this, it implicitly learns which tokens are most indicative of a successful or failed solution path. And because of this, it doesn't constrain the KL-divergence of the reference model when training the reward model.

Guest: Okay, so it's learning to associate specific tokens or token sequences with higher or lower chances of getting the right answer. It's like a more fine-grained version of the overall outcome reward. And then they use those token-level scores to re-weight the original training loss, focusing the learning on the most critical reasoning steps or errors.

Host: Exactly. This allows the model to learn more efficiently from the sparse outcome rewards by focusing on the parts of the reasoning process that matter the most. It's a clever way to bridge the gap between the limited feedback signal and the dense policy optimization requirements.

Guest: It sounds like they've created a pretty sophisticated framework that combines theoretical insights with practical techniques to address the challenges of outcome-based reinforcement learning for math reasoning. So, they have the theoretical framework, the compensation for negative samples and the token weighting. What's the implementation like?

Host: Let's move into the Implementation details. So, for the implementation, they started with the Qwen2.5-7B and Qwen2.5-32B models as their base models. These are already pretty powerful language models, but they wanted to see how much further they could push them with OREAL. They fine-tuned them initially using long chain-of-thought data obtained through rejection sampling – basically, generating a bunch of solutions and keeping the good ones.

Guest: Okay, so they're starting with a solid foundation by pre-training the models on a dataset of well-reasoned solutions. That makes sense – it gives the models a good understanding of how to approach math problems and generate coherent reasoning steps.

Host: Right. These 'rejection sampling fine-tuned' (RFT) models then serve as the initialization for the policy model in their RL framework. They also explored using DeepSeek-R1-Distill-Qwen-7B as the initial policy model, which I think is a great idea to see how OREAL performs with different starting points. They used a mix of in-house datasets and open-source datasets for training the RFT models.

Guest: It's smart to use a combination of datasets. The in-house datasets could provide more specific or targeted training examples, while the open-source datasets offer a broader range of problems and solutions. Using DeepSeek-R1-Distill-Qwen-7B lets them test the algorithm starting from what they consider to be the best 7b model. It would be interesting to see how performance differs!

Host: Exactly. Now, let's talk about the reinforcement learning process itself. During on-policy RL, they used questions from various sources, including Numina, the MATH training sets, and historical AMC/AIME competitions. For each question, they independently sampled 16 trajectories from the RFT models.

Guest: So, they're generating 16 different attempts to solve each problem. That seems like a good balance between exploration and computational cost. What do they do with those trajectories?

Host: Well, first, they average the correctness of each trajectory to estimate the correctness rate of each query. Then, to increase the difficulty of the training queries, they only kept questions with correctness rates between 0 and 0.8 for further training.

Guest: Ah, that's a clever way to focus the training on problems that the model is struggling with but not completely failing at. It's like saying, 'Okay, we already know how to solve these easy problems, so let's focus on the ones that are just challenging enough to help us improve.' It's like a curriculum learning approach.

Host: Precisely! It helps the model learn more efficiently by focusing on the 'sweet spot' of difficulty. Now, for the outcome reward signal, they used a combination of the Qwen2.5-72B-Instruct model as a generative verifier and a rule-based verifier. This is to enhance the robustness of the correctness assessment.

Guest: So, they're not just relying on a simple rule-based check to see if the answer is correct. They're also using another large language model to verify the solution. This makes sense, as it could catch errors that a simple rule-based system might miss, like different but equivalent forms of the correct answer. Mitigating false negatives is critical.

Host: Exactly. Now, for training the token-level reward model, they used the binary outcome rewards from the verifier and optimized using the cross-entropy loss. As a reminder, the token-level reward model assigns token-wise importance scores across the chain-of-thought reasoning process.

Guest: So the token level reward model learns from the outcome rewards to determine the contributions of the intermediate reasoning steps and weigh the tokens appropriately during the training process.

Host: Let's talk about the training algorithm. The loss function for the policy model follows the formulation described earlier, incorporating the KL-constrained max-likelihood-objective over positive examples and the reward shaping for negative samples. The complete RL training procedure is detailed in Algorithm 1 in the paper.

Guest: Alright, so they're putting all the pieces together: the initial policy model, the best-of-N sampling, the reward shaping, and the token-level reward model. What about the specific hyperparameters they used? Those can often make or break an RL experiment.

Host: Good point! The policy model was trained with a learning rate of 5e-7, while the token-level reward model had a learning rate of 2e-6. The latter also underwent a 10-step warm-up phase before training began. Both models used a cosine annealing learning rate schedule, decaying to 1/5 of the initial learning rate over time.

Guest: Those learning rates seem pretty small, which suggests they were trying to avoid drastic changes to the policy during training and promote more stable learning. The warm-up phase for the token-level reward model is also a good idea, as it allows the model to gradually learn the relationships between tokens and outcomes without disrupting the policy model too early.

Host: I think so too. They optimized both models using the AdamW optimizer, and the total number of training steps was 80, with evaluation conducted every 10 steps. The KL coefficient was set to 0.01. They selected the best-performing model based on evaluation metrics. Now, this is interesting. They also implemented a skill-based enhancement approach.

Guest: Oh, tell me more! What's that all about?

Host: During the RL training procedure, they noticed that the model consistently struggled with certain types of questions, particularly those involving specific knowledge and skill areas, like trigonometric constant transformations or probability statistics. It probably just reflects a lack of exposure in the training dataset.

Guest: That makes sense. Even with a large dataset, there are bound to be gaps in the model's knowledge. So, how did they address those skill deficiencies?

Host: To address this, they implemented a skill-based enhancement approach, using the MATH dataset to reduce the high cost of skill annotation. They annotated each question in the training set with its corresponding core skill. For questions that the model repeatedly failed to answer correctly during the RL phase, they performed data augmentation by including similar questions from the training set that shared the same skill.

Guest: So, they're essentially creating a mini-curriculum for the model, focusing on the specific skills that it's struggling with. That's a really targeted and efficient way to improve performance. It's like having a tutor who knows exactly what your weaknesses are and gives you extra practice on those areas.

Host: Precisely! And those augmented questions were then added to the training data during the RFT stage to help the model better internalize these skills. So, they're going back and improving the initial policy model with the new skill-specific data.

Guest: That makes total sense. By addressing those knowledge gaps in the initial policy model, they're setting the stage for even better performance during the reinforcement learning phase. It's a holistic approach that combines both broad pre-training with targeted skill refinement.

Host: Exactly! Now that we've covered the methods and implementation, let's talk about the experiments. They evaluated their approach against a range of baselines, including some pretty impressive models like GPT-4o, Claude-Sonnet, and OpenAI's o1 series.

Guest: Okay, so they're really putting OREAL to the test against the state-of-the-art. What benchmarks did they use to measure performance?

Host: They used some well-established mathematical datasets, including MATH-500, AIME2024, AIME2025 (Part1), LiveMathBench, and OlympiadBench. These datasets cover a wide range of mathematical topics and difficulty levels, so they provide a comprehensive evaluation of the model's reasoning abilities.

Guest: Those are some pretty challenging benchmarks, especially the AIME and Olympiad problems. What metric did they use to measure performance?

Host: They used pass@1 as the metric for evaluation under the zero-shot chain-of-thought setting. This means they're evaluating the model's ability to solve the problem correctly on its first attempt, without any prior examples or fine-tuning on the specific dataset.

Guest: Okay, so it's a pretty strict evaluation setup. What were the overall results? How did OREAL stack up against the baselines?

Host: Well, the results were pretty impressive. At the 7B scale, OREAL-7B achieved a remarkable pass@1 accuracy of 91.0 on the MATH-500 dataset and 59.9 on OlympiadBench. This is the first time a model of this size has reached such a high level of accuracy using RL instead of distillation!

Guest: That's huge! Beating OpenAI's o1-mini is a milestone! So, their RL method is superior. What does that mean for future models?

Host: Exactly! It even surpasses significantly larger models, including QwQ-32B-Preview and OpenAI-o1-mini. What's even more impressive is that after applying OREAL to the DeepSeek-R1-Distill-Qwen-7B model, it achieved 94.0 and 66.1 pass@1 accuracy on MATH-500 and OlympiadBench, respectively. This sets new records among all 7B models.

Guest: So it is effective even with stronger initialization. What does that mean for the algorithm's prospects?

Host: I'm pretty impressed. I think the algorithm will be used more widely in the future. OREAL-32B achieves state of the art as well!