DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today, we're diving into something really fascinating – the world of large language models and their reasoning capabilities. I'm your host, Leo, and I'm super excited about this topic. We've all seen how quickly AI is advancing, and it's really exciting to unpack some of the latest research.
Guest: Hey Leo, thanks for having me! I'm stoked to be here and chat about LLMs. It’s wild how fast things are moving, right? It feels like every few weeks there’s a new breakthrough or paper that completely changes the landscape. I’ve been digging deep into this particular paper, and I’m really looking forward to sharing my insights.
Host: Absolutely! The pace is incredible, and that's why it's so important to stay on top of these developments. Today, we're going to focus on a very cool paper from DeepSeek-AI, titled 'DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.' It’s a really deep dive into how they're getting these models to reason more effectively. It is a bit dense and technical, so we'll try to break down the key points. How about we start with the introduction and talk about what’s new here?
Guest: Sounds great, Leo. So, the introduction really sets the stage by highlighting the rapid progress we've seen in Large Language Models or LLMs. They're getting closer and closer to what we might call Artificial General Intelligence or AGI. These models aren’t just regurgitating information; they're actually starting to show a real capacity to reason, which is a massive step forward. This paper focuses specifically on enhancing reasoning capabilities. A key trend they discuss is ‘post-training’, which is where you take a pre-trained model and further refine it for specific tasks, such as reasoning. This has become a crucial component in the full training pipeline. It's very computationally efficient compared to pre-training and is proving to be really effective at improving performance. This is where the idea of Chain-of-Thought reasoning comes in and it is really important to understand here. It involves getting the model to explain its reasoning steps, which is a game-changer, not just for getting better answers but also for understanding how the model is actually thinking.
Host: Yeah, the Chain-of-Thought (CoT) approach is a huge deal. It's like, instead of just getting a final answer, you're actually seeing the model's thought process, which is super insightful. It’s not just about getting the right answer, but understanding how the model got there. And it opens up a lot of possibilities. What I find interesting about this paper's intro, is that it also points out the existing challenges with test-time scaling. I think this is super important as we are not able to increase the size of LLMs infinitely for better results. The traditional approach of just increasing the size of the model and the CoT might not be scalable in the long run. It highlights this question as an open research problem, which really sets the stage for their approach. They mention a few methods like process-based reward models and search algorithms. But the researchers at DeepSeek-AI emphasize that none of these methods have quite matched the performance of models like OpenAI's o1 series. That’s a pretty high bar to clear. So they’ve decided to tackle this problem with a different approach: pure reinforcement learning.
Guest: Exactly, Leo. That's the core innovation here. They're saying, 'let's not just rely on supervised data, let’s see if we can get LLMs to develop reasoning skills purely through reinforcement learning.' This is a pretty bold move. They aim to demonstrate the potential of LLMs to self-evolve their reasoning capabilities through the process of RL, which means no pre-training with any supervised data. They start with DeepSeek-V3-Base as their base model and use something called GRPO (Group Relative Policy Optimization) as their RL framework. And this is where they create two versions: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero is where they've applied pure RL and it is really interesting. They found it developed these remarkable reasoning behaviors. For instance, it got an impressive score on the AIME 2024 math test, jumping from 15.6% to 71%, and with majority voting, it shot up to 86.7%. That’s on par with OpenAI’s o1-0912 model. It is an impressive leap just through Reinforcement learning, and I think it proves the power of this approach.
Host: That's an incredible jump! Going from 15.6% to 86.7% using just RL is a testament to the potential of this approach. It really highlights the capabilities of pure Reinforcement Learning. But, as always, there's a catch, right? The paper mentions that DeepSeek-R1-Zero had some drawbacks, like poor readability and language mixing. This makes sense, because it was focused on just getting the right answers, not necessarily communicating clearly. That’s where DeepSeek-R1 comes in, which was developed using a multi-stage training approach to address these limitations and further enhance performance. This process really underscores how fine-tuning can be used to get to the perfect model, combining the best aspects of various approaches.
Guest: Precisely, Leo. So, DeepSeek-R1 is their attempt to refine the process and address those issues. They’re using a multi-stage approach, which includes adding a small amount of ‘cold-start data’ and two stages of SFT (supervised fine-tuning), and two stages of RL to really optimize the model for reasoning and general capabilities. The ‘cold-start’ data refers to data collected prior to starting the RL process. They first fine-tune their base model on thousands of long Chain-of-Thought examples, and then they go back to their RL approach. Once they've approached a satisfactory convergence during RL, they create new SFT data using rejection sampling on the RL checkpoint. They then combine this with supervised data from DeepSeek-V3. After fine-tuning with this new data, the model then goes through an additional RL process considering a wide variety of prompts. This model, DeepSeek-R1, ends up reaching performance comparable to OpenAI-o1-1217 which is really the most significant part of the paper. They also explore distillation from DeepSeek-R1 to smaller dense models, which is another crucial aspect, demonstrating the effectiveness of this pipeline in building state-of-the-art models, which can then further be distilled into smaller models for wider use. This is a very strategic way to approach the field, as it is not sustainable to keep building larger and larger models.
Host: The multi-stage approach really seems like a well-thought-out strategy, combining the strengths of different methods. It's like they're trying to build the best possible reasoning model by layering different training techniques, starting with a base of well-reasoned examples, and then refining using reinforcement learning and further fine tuning again. And this idea of distillation is really crucial. We are hitting the limit of computation and we can't scale up models endlessly. That’s why the ability to transfer the reasoning patterns of a larger model to smaller models is vital. They found that distilling from DeepSeek-R1 to smaller models outperformed applying RL directly on those smaller models which tells us that the way that larger models approach reasoning is very important. This is such an important point for the entire field of LLMs. By open-sourcing their models, they’re also contributing greatly to the research community, allowing others to build on their work. And from what I understand, their distilled models are setting new records on various benchmarks. Their distilled 14B model even outperformed QwQ-32B-Preview by a significant margin and 32B and 70B models are setting new records on benchmarks, which is wild! I think that summarizes the key points from the introduction, right? Should we get into their approach in the next section?
Guest: Absolutely, Leo. I think we’ve covered the core ideas from the introduction well. The way they’ve structured their models and the overall approach makes a lot of sense. Let's dive into the 'Approach' section, where they outline the specific methods used to develop these models, particularly DeepSeek-R1-Zero and DeepSeek-R1. It's crucial to understand the differences between the two. This section really details how they've implemented those ideas from the introduction, and it's where the technical details start to come into play. They dive into how these models were trained, how the RL algorithms were used, and it’s important to understand the differences. The introduction laid out the big picture, and now we get into the nitty-gritty of how they executed this research. This section is actually quite detailed, so we might need to break it down step by step, starting with the overall method and then moving into the specifics of each of their approaches.
Host: Sounds good! I think breaking it down step-by-step is a good idea because there’s a lot of information to unpack here. The core idea from this approach, like they mention in the beginning, is to improve reasoning capabilities through large scale Reinforcement Learning or RL, and that is exactly what is new here. They start by highlighting how past approaches rely on lots of supervised data, while here they demonstrate that RL can lead to significant improvements without this. And they really emphasize that performance can be further enhanced by introducing a small amount of ‘cold-start data’. That really highlights the difference between DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero is where they are applying the RL directly to the base model without any supervised fine-tuning. DeepSeek-R1, on the other hand, uses RL with a checkpoint that has been fine-tuned with cold-start data. So, let's start by unpacking DeepSeek-R1-Zero. This one is particularly fascinating because it tries to build reasoning capabilities from almost nothing through pure RL.
Guest: Okay, so let's break down DeepSeek-R1-Zero. The key here, as you pointed out Leo, is that they directly apply reinforcement learning to the base model without any preliminary supervised fine-tuning. This is a significant departure from previous methods and it's really what makes this paper so interesting. They're exploring the idea that reasoning can emerge as a natural outcome of the learning process driven by RL. They mention how previous works heavily depended on time-consuming supervised data. This paper is trying to challenge that. They start with a brief overview of their RL algorithm, which is really based on what they called Group Relative Policy Optimization or GRPO. This is crucial because it’s designed to reduce the computational costs associated with RL. Unlike traditional RL approaches, GRPO doesn't require a critic model that is the same size as the policy model. Instead, GRPO estimates the baseline from group scores. This significantly cuts down on the resources needed for training. I think this is really strategic in the overall design. If they want to push the boundaries, they also need to figure out ways to cut costs, which is what they have done with the GRPO. And then they move onto explaining how the rewards are given, which are really critical for an RL model.
Host: Yeah, this GRPO method is really smart. It's all about making the RL process more efficient, allowing them to push the boundaries without breaking the bank. And as you pointed out, the reward system is the heart of any RL process, because it tells the model what to aim for. For DeepSeek-R1-Zero, they adopted a rule-based reward system based on two types of rewards: accuracy rewards and format rewards. Accuracy rewards simply check if the answer is correct. For example, for math questions, the model is required to output a final answer in a specific format, usually within a box, which then can be easily checked using a rule-based verification process. For code problems, the code is run against pre-defined test cases to determine its correctness. And they also use format rewards to ensure the model puts its reasoning process in between '<think>' and '</think>' tags. This is a simple way to help get the structure right. What I think is really important is that they did not use any neural network based reward models, which is very different. They believe that these kinds of models can suffer from reward hacking, and retraining reward models also needs a lot of extra computation, which they wanted to avoid. So instead, they used the simple and direct rule-based models, which is really interesting.
Guest: That's a really good point, Leo, about the rule-based reward system. It's a key design decision that simplifies the training process but also might limit the kinds of complex reasoning the model can learn. I think they made this decision because they wanted to really isolate and observe the core reasoning capabilities without being influenced by more complex reward models. It also highlights that the goal is not just accuracy. The structure of the thought process and the way its reasoning is laid out is also really important to be able to explain how these models are working, especially for research. They mention a specific training template that they used to get the model to comply with their requirements. The model is instructed to first generate a reasoning process enclosed within the <think> tags and then provide the answer within <answer> tags. The interesting part is that the researchers intentionally limited the constraints only to this particular structural format. They really tried to avoid any content specific biases and this also ensures that they can accurately observe how the model naturally progresses during the RL process. It’s very fascinating to see how the model responds to only these structural instructions. And it is not as if the model is being told what to think, it is just being told how to present its thoughts.
Host: Yeah, it's like they’re setting up a very controlled environment to observe the raw evolution of reasoning capabilities. And this focus on a simple format highlights their goal which is to understand how the model’s reasoning emerges. They weren't trying to force specific problem solving or reflective reasoning methods. They just wanted to see what happens. And this section also goes into the performance, the self-evolution process, and what they called an ‘Aha moment’ that they saw with the DeepSeek-R1-Zero, which is really fascinating. Starting with the performance of this model, they've compared it with OpenAI’s o1 models. They show that RL empowers DeepSeek-R1-Zero to gain really strong reasoning capabilities without any need for supervised fine-tuning. And it also highlights that DeepSeek-R1-Zero’s performance can be increased further by using majority voting. For instance, by using majority voting on the AIME benchmark, the score goes up from 71% to 86.7% which surpasses o1-0912. This highlights the model’s really strong foundation, and its potential to make even further improvements.
Guest: Exactly, Leo. The fact that DeepSeek-R1-Zero achieves such competitive performance with and without majority voting highlights its really strong capabilities. It shows the underlying potential of this approach. But the performance numbers are not the whole story here. The paper also discusses the model's 'self-evolution process' which is really about how DeepSeek-R1-Zero's reasoning abilities develop over time just through RL. And the most interesting observation is that the model's thinking time increases consistently during the training process. And that is purely an internal development of the model, and not an external adjustment. It's learning to dedicate more computational resources to solve complex tasks, ranging from hundreds to thousands of tokens to reason. That leads to the spontaneous emergence of complex reasoning behaviors, such as reflection, where the model re-evaluates its previous steps. It also explores different approaches to problem-solving all on its own. I think that highlights the potential of the RL to facilitate some kind of internal development in LLMs. It’s very different from just relying on human-labelled data and that is a very strategic way to push the boundaries. It’s learning to think deeper by itself.
Host: That’s really fascinating – how the model is, by itself, learning to think harder and re-evaluating its own steps. And this is what they are calling self-evolution. And this then leads to what they call the 'Aha moment'. They saw an instance where an intermediate version of DeepSeek-R1-Zero, learns to allocate more thinking time to a problem by re-evaluating its own steps. The model suddenly realizes it needs to re-think its initial approach. And this is where we can also see a kind of anthropomorphic tone come out. This 'aha moment' is not just for the model, but also for researchers. The paper really emphasizes the power and beauty of RL, in that they didn’t have to tell the model how to solve the problem, they just had to give it the right motivations, and the model developed its own problem solving methods. It really shows the potential of RL to get to new levels of intelligence. But despite all of the great results, they do acknowledge the drawbacks of DeepSeek-R1-Zero, specifically things like poor readability and language mixing which really points to why they moved on to developing DeepSeek-R1. It is also an acknowledgment that just getting the reasoning right is not the end goal, we also need to be able to explain it clearly to humans.
Guest: Exactly, Leo. The 'aha moment' is really powerful. It’s not just about performance metrics. It's about the qualitative changes in the model's behavior that are really interesting. It proves that with the right incentives, LLMs can discover their own reasoning processes. But then, as you rightly pointed out, the drawbacks of DeepSeek-R1-Zero are significant, and it really led them to explore DeepSeek-R1. The issues with readability and language mixing, makes it hard for humans to benefit from the complex reasoning abilities of the model. They even point out that because the outputs are not suitable for reading, the responses would often mix languages or lack markdown formatting, which would make the reasoning hard to explain. So, that’s really where DeepSeek-R1 comes into the picture. It is all about using RL, but with human friendly data. It's designed to address these limitations of DeepSeek-R1-Zero by incorporating cold-start data and a multi-stage pipeline. DeepSeek-R1 is their solution to making the insights from these complex models, more accessible and understandable to us.
Host: Yeah, the transition from DeepSeek-R1-Zero to DeepSeek-R1 really highlights the practical considerations in developing these models. It is not just about pushing performance but also about making them useful and understandable. It also is a proof that the model is not simply an isolated black-box and some of its features can be molded and controlled by human intervention. So let’s break down the approach taken by the researchers for DeepSeek-R1. In this method they incorporate the ‘cold-start data’ which has long Chain-of-Thought examples. They use this data to fine-tune their model before RL. This is different from DeepSeek-R1-Zero which starts the RL process from the base model. They also follow a multi-stage approach which is focused on improving reasoning and general capabilities. This is where they’re directly addressing the challenges that they found in DeepSeek-R1-Zero, which I think is a smart approach. What is your take on the cold-start approach?
Guest: I agree, Leo. The cold-start data is a crucial part of DeepSeek-R1's methodology. It's their way of jump-starting the model with high-quality, human-friendly data. The aim here is to prevent the unstable cold-start phase they encountered in DeepSeek-R1-Zero’s pure RL training, and the reasoning capabilities start from an already improved model, compared to a base model. They explain that they have explored several approaches to collecting this data. One of those approaches is using few-shot prompting with long chain-of-thought as examples, and another one is by directly prompting models to generate detailed answers with reflection and verification, and even gathering the outputs from DeepSeek-R1-Zero and refining them with human annotators. This multi-pronged approach really shows the researchers' commitment in getting the highest quality data, which is critical for the fine-tuning process. The advantages that this data brings include improved readability and higher overall performance in the next phase of the training, when compared with DeepSeek-R1-Zero. And the output format is specifically designed to be user friendly with the <reasoning_process> tag and the <summary> tag.
Host: Yeah, it's clear they put a lot of effort into making that cold-start data high quality and also user friendly. The readable format they use with the summary at the end of each response is very smart, as it specifically filters out responses that are not reader friendly. And then, after the fine-tuning process with this cold start data, they follow it up with the reasoning oriented reinforcement learning, where they focus on enhancing reasoning capabilities, with complex tasks such as coding, math, science, and logic puzzles. It’s really focused on problems that have a clear solution. During this stage they noticed some language mixing when the prompts involved multiple languages, so they introduce a language consistency reward to fix that. This is another example of how they are fine-tuning it for human usage. They also mentioned that while this approach did lead to a slight decrease in performance, it is more in line with human preferences, and does not significantly compromise the model. And this really emphasizes that the focus is not just about absolute performance but also about usability.
Guest: That's a really important distinction, Leo. It’s not just about how high the model's score is, but also how useful and understandable it is for humans. The language consistency reward is a great example of them prioritizing usability along with performance. They also mention that they combine the accuracy of the reasoning tasks and the reward for language consistency by adding them up to create a final reward. And they apply the RL training on the fine-tuned model until it has reached convergence on reasoning tasks. After this reasoning focused RL converges, they use the resulting checkpoint to then collect data for the next supervised fine-tuning stage. But what is interesting here is that this stage doesn’t just focus on the reasoning aspect. In this second round of SFT, it also includes data from other domains such as writing, role-playing, and other general purpose tasks, which really adds versatility to the model. This is where they incorporate additional data which was not included in the cold start SFT, which focused only on the reasoning aspect. This second SFT stage really makes sure the model is able to perform well in general areas and also the complex reasoning tasks. This is another crucial step for building a more robust and versatile model. It really gives the model a range of abilities, which is really important for practical use.
Host: Absolutely! It shows that they're not just focused on making a reasoning machine, but also an all around useful model. They've used a lot of different data sources, where some of the data is generated from the model itself using rejection sampling from the RL checkpoint, and also from other sources such as the DeepSeek-V3 dataset. And for some of these tasks they also used DeepSeek-V3 itself to generate the chain of thought before answering the questions. But they also mention that for simpler prompts, they did not use the Chain of Thought to generate responses, which highlights how the model is also designed to be adaptable based on the question. And they also filter out chaotic and difficult to read chain-of-thoughts, which further shows how readability was an important factor in the design of the overall model. They collected around 600k reasoning related samples and 200k non-reasoning samples for the second SFT phase. And this is then followed up by the reinforcement learning phase, where they try to align the model with human preferences, to be helpful and harmless while maintaining its reasoning abilities. So they are still using the reinforcement learning approach to fine tune it to get it to work really well. They use a combination of reward signals and diverse prompt distributions to achieve this goal. They use the rule-based reward system, for reasoning data, similar to what they used in DeepSeek-R1-Zero, and for general data they used reward models. They’ve also separated the helpfulness and harmlessness assessment, by only focusing on the final summary for the helpfulness, and also assessing the whole response for harmlessness. It's like they are layering different techniques to optimize performance while also making sure the model is safe and helpful. It’s really a comprehensive design.
Guest: You've hit the nail on the head, Leo. This is really a comprehensive design and that last RL stage is really focused on aligning it with human preferences, ensuring the model is not just accurate, but also helpful and safe. And finally, in this 'Approach' section, they touch upon how they distill the reasoning capabilities from DeepSeek-R1 to smaller models. It highlights that it is more cost effective and practical to distill powerful models to smaller models. They directly fine-tune open source models like Qwen and Llama using the 800K curated training samples they used in the previous SFT stage. And what they find here is that this very simple and straightforward distillation method significantly improves the reasoning abilities of the smaller models. They point out that they did not use RL at this stage, and that the models were only trained using SFT, just to prove the effectiveness of the distillation method. They wanted to show how powerful this strategy is. It is such a crucial part of the research, because it makes these incredibly advanced models more accessible and cost-effective for widespread use. And it is a really promising and powerful way to improve the field of LLMs. So that's the breakdown of their approach. It's clear they put a lot of thought and work into each stage, focusing on not just performance but also usability and safety. What's your take on it before we move on?