10 Things You Need to Know about DeepSeek R1 (As an ML Engineer)
At first, it seems like I’m a bit late to the party. DeepSeek R1 was introduced this past month as an LLM that rivals state-of-the-art LLMs from companies like OpenAI. Needless to say, DeepSeek R1 has been getting a lot of attention recently. Over the past two years, we’ve seen a hyper-obsession with LLMs, after OpenAI released the chatbot known as ChatGPT in late 2022. Ever since, there has been a very publicized race to release the fastest and most accurate model. Of course, the players were some of the biggest companies and research centers in the world. Now, DeepSeek R1 is a model that rivals and even sometimes outperforms the state-of-the-art models we’ve seen so far. And that’s not even why DeepSeek R1 is so exceptional compared to all the other state-of-the-art LLMs. DeepSeek R1 is revolutionary because its algorithm and its open source release are integral to AI’s democratization. That’s why every ML/AI professional needs to become familiar with DeepSeek R1 - both its inner workings and the potential effects it has on the future of AI. As a starter, here are ten things every ML Engineer needs to know about DeepSeek R1. Reinforcement Learning is a game changer for LLMs. Previous state-of-the art LLMs have also incorporated reinforcement learning into their algorithms. However, the paper that introduced DeepSeek R1 showed that pure reinforcement learning can greatly improve the model’s reasoning capabilities over a period of time. Meanwhile, the traditional method for training LLMs relies on reinforcement learning from human feedback (RLHF). Why is pure reinforcement learning any better? The paper that introduced DeepSeek R1 addressed a major challenge in AI. Specifically, that challenge is to train the models without having to rely on large datasets consisting of labeled reasoning examples. The model that only used reinforcement learning proved that extensive supervised training is no longer necessary. After all, large datasets with the right information are difficult to obtain and expensive to compile. Why not build a model that you can teach over time as you gain access to more information? Reasoning through reinforcement learning is a huge contribution which can help researchers build flexible LLMs that learn more on demand. There is another major difference between a model that uses pure reinforcement learning and one that uses RLHF. The paper compares the results of two variations of the former to those of two OpenAI-o1 models. A line graph shows the accuracy of all four models over a period of time. The results from the OpenAI-o1 models are static. At first, those two models greatly outperform the two models developed by the DeepSeek researchers. However, the two models that only use reinforcement learning are dynamic. They get better over time, and eventually rival or even beat the OpenAI-o1 models. If given more time, they will possibly surpass the two models that performed better initially. It has an invisible sibling known as DeepSeek R1-Zero. Did you know that the DeepSeek R1 paper actually introduces two models? In addition to the model with which we are familiar, R1, it introduces another model known as R1-Zero. The paper establishes that this model is truly the backbone of R1. There are several major differences between R1-Zero and R1. The former uses pure reinforcement learning, which is the strategy that we talked about in the previous section. Yes, it was actually R1-Zero that outperformed the OpenAI-o1 models and proved that reinforcement learning can remove the need for large datasets. R1 does not use pure reinforcement learning, but it doesn’t rely on huge datasets either. The training pipeline for R1-Zero is very simple. The first step is unsupervised pretraining. Then the next and last step is GRPO, which is an optimization process that improves the model through reward signals. For R1, the training pipeline is similar but of course more complex, incorporating cold start finetuning and supervised finetuning (SFT). Cold start finetuning trains on a few thousand examples of reasoning problems, some of which R1-Zero generated and filtered. Supervised finetuning presents the model with training examples in the form of prompt and correct completion. Clearly the researchers went through a lot of effort to build the R1 model. Why was R1 even necessary, especially considering that R1-Zero demonstrates very high performance and learning ability? The reason is that even though R1-Zero is capable of eventually learning how to answer questions, it actually isn’t usable. For example, the model struggles with poor readability and it even mixes languages. Thus, R1 was introduced because it is meant to be a more usable model. Nonetheless, R1 and R1-Zero are both publicly available as open source models. DeepSeek R1 uses GRPO, which improves upon the PPO algorithm used by OpenAI. It introduces Group Relative Policy Optimiz
At first, it seems like I’m a bit late to the party.
DeepSeek R1 was introduced this past month as an LLM that rivals state-of-the-art LLMs from companies like OpenAI.
Needless to say, DeepSeek R1 has been getting a lot of attention recently.
Over the past two years, we’ve seen a hyper-obsession with LLMs, after OpenAI released the chatbot known as ChatGPT in late 2022. Ever since, there has been a very publicized race to release the fastest and most accurate model. Of course, the players were some of the biggest companies and research centers in the world.
Now, DeepSeek R1 is a model that rivals and even sometimes outperforms the state-of-the-art models we’ve seen so far. And that’s not even why DeepSeek R1 is so exceptional compared to all the other state-of-the-art LLMs. DeepSeek R1 is revolutionary because its algorithm and its open source release are integral to AI’s democratization.
That’s why every ML/AI professional needs to become familiar with DeepSeek R1 - both its inner workings and the potential effects it has on the future of AI.
As a starter, here are ten things every ML Engineer needs to know about DeepSeek R1.
Reinforcement Learning is a game changer for LLMs.
Previous state-of-the art LLMs have also incorporated reinforcement learning into their algorithms. However, the paper that introduced DeepSeek R1 showed that pure reinforcement learning can greatly improve the model’s reasoning capabilities over a period of time. Meanwhile, the traditional method for training LLMs relies on reinforcement learning from human feedback (RLHF).
Why is pure reinforcement learning any better?
The paper that introduced DeepSeek R1 addressed a major challenge in AI. Specifically, that challenge is to train the models without having to rely on large datasets consisting of labeled reasoning examples. The model that only used reinforcement learning proved that extensive supervised training is no longer necessary.
After all, large datasets with the right information are difficult to obtain and expensive to compile. Why not build a model that you can teach over time as you gain access to more information? Reasoning through reinforcement learning is a huge contribution which can help researchers build flexible LLMs that learn more on demand.
There is another major difference between a model that uses pure reinforcement learning and one that uses RLHF. The paper compares the results of two variations of the former to those of two OpenAI-o1 models. A line graph shows the accuracy of all four models over a period of time.
The results from the OpenAI-o1 models are static. At first, those two models greatly outperform the two models developed by the DeepSeek researchers.
However, the two models that only use reinforcement learning are dynamic. They get better over time, and eventually rival or even beat the OpenAI-o1 models. If given more time, they will possibly surpass the two models that performed better initially.
It has an invisible sibling known as DeepSeek R1-Zero.
Did you know that the DeepSeek R1 paper actually introduces two models? In addition to the model with which we are familiar, R1, it introduces another model known as R1-Zero. The paper establishes that this model is truly the backbone of R1.
There are several major differences between R1-Zero and R1. The former uses pure reinforcement learning, which is the strategy that we talked about in the previous section. Yes, it was actually R1-Zero that outperformed the OpenAI-o1 models and proved that reinforcement learning can remove the need for large datasets.
R1 does not use pure reinforcement learning, but it doesn’t rely on huge datasets either.
The training pipeline for R1-Zero is very simple. The first step is unsupervised pretraining. Then the next and last step is GRPO, which is an optimization process that improves the model through reward signals.
For R1, the training pipeline is similar but of course more complex, incorporating cold start finetuning and supervised finetuning (SFT). Cold start finetuning trains on a few thousand examples of reasoning problems, some of which R1-Zero generated and filtered. Supervised finetuning presents the model with training examples in the form of prompt and correct completion.
Clearly the researchers went through a lot of effort to build the R1 model. Why was R1 even necessary, especially considering that R1-Zero demonstrates very high performance and learning ability?
The reason is that even though R1-Zero is capable of eventually learning how to answer questions, it actually isn’t usable. For example, the model struggles with poor readability and it even mixes languages. Thus, R1 was introduced because it is meant to be a more usable model.
Nonetheless, R1 and R1-Zero are both publicly available as open source models.
DeepSeek R1 uses GRPO, which improves upon the PPO algorithm used by OpenAI.
It introduces Group Relative Policy Optimization (GRPO) in order to score how well the model responds to a question without having the correct answer. This is an improvement upon Proximal Policy Optimization (PPO), an implementation in RLHF, which OpenAI used for its models. PPO is a method that trains LLMs by using reward signals with the goal of improving results.
For the record, GRPO also aims to improve a model through reward signals.
A key component of PPO is something called a value model. The challenge in RLHF is that the model only sees the reward after evaluating the full text, but it needs feedback for each token it generates. The value model addresses this challenge by learning to predict future rewards at each token position.
However, GRPO gets rid of the value model while still ensuring that its implementation of reinforcement learning is highly effective.
Rather than using a value model, GRPO does the following. It returns multiple rewards for each query instead of one reward as done by PPO. Then it returns for each reward a value given by the advantage function which measures how much the reward has changed based on the average reward.
For example, if you get n number of outputs for a math problem, and one solution has a reward of 0.9 while the group average is 0.7, then that solution would get a positive advantage of 0.2.
GRPO ensures a natural method for determining if an output to a query is better or worse than average. That way, you don’t have to go through the trouble of training a value model to make predictions about future rewards.
Chain of thought greatly reduces the chances of the model making mistakes.
Chain of thought (CoT) is one of the three main concepts in the DeepSeek R1 paper, along with reinforcement learning and model distillation. It can be defined as the step during which the model reasons before presenting the solution.
The concept is also highly relevant to prompt engineering. We can ask the model to essentially think out loud. To at least some extent, the model will return an explanation of its reasoning before reaching a conclusion.
The R1-Zero version of DeepSeek actually reveals its chain of thought in its answer, as shown in the study. Other similarly performing models, such as the OpenAI models, don’t do this.
There is a major advantage to this type of response. If the model makes a mistake, we can easily pinpoint where in its reasoning it was incorrect. Then, we can re-prompt the model so that it doesn’t make the same mistake again.
As a result, the model will have a more accurate response than if you just let it give the response by itself without chain of thought reasoning. Ultimately, this can lead to LLMs giving more accurate responses and even avoiding hallucination.
A variety of posts on Twitter and YouTube have shown that DeepSeek R1 can even solve the infamous “strawberry” problem.
A major priority is for the model to maximize its reward, while placing a limit on the change in its policy.
As we discussed in an earlier section, the optimization process is given by PPO or GRPO. If you recall, GRPO is the optimization process in the DeepSeek R1 study.
Within the optimization process, there is a policy model. It is the policy model that takes the query and generates one output for PPO and multiple outputs for GRPO.
The model’s policy is defined as how the model behaves. In reinforcement learning, the goal is to optimize the model’s policy while training the model. Optimizing the model’s policy means maximizing its reward.
The idea of a model gaining more information is akin to a sentient being exploring its environment. Over time, the model learns which policies maximize the reward. Then it determines its policy according to that.
For example, consider that there may be two ways to solve an equation. However, one solution can be attained much more quickly and easily than the other solution. Thus, the quicker and easier solution has a much higher reward.
The GRPO optimization process in the DeepSeek R1 study is given by an equation. If you look at the equation in the paper, the π variable represents the policy. Because we want to optimize the policy, we essentially want to change it so the model can yield better answers.
However, we don’t want to change the policy too much. This is because that can cause a lot of instability with model training. The goal is for our model is to be as stable as possible, and avoid a roller coaster of policy changes.
This is when clipping becomes relevant. Clipping restricts how much our policy can change by 1 - ε and 1 + ε. This clipping function also appears in the study’s main equation, which uses GRPO in order to score the model’s response to a given query without having the correct answer.
Unlike the o1 model by OpenAI, DeepSeek R1 reveals what happens between the
tags.
For those familiar with HTML and other markup languages, the
tag will be especially easy to understand. The user experience for an LLM prompt and response looks something like this: the user asks a question, the LLM thinks of a response and encloses it in the
tags, the LLM directly answers the question.
The chain of thought which we mentioned in an earlier section is the content inside of the
tags. Depending on the complexity of the task, the chain of thought can be quite long.
The o1 model by OpenAI no doubt uses chain of thought to answer challenging questions, including coding tasks and Math Olympiad problems. However, OpenAI never disclosed what happens within the
tags.
The DeepSeek R1 model and its variations are different. The paper shows an example of the R1-Zero model’s chain of thought when it answers a math question. The model even demonstrates an “aha moment” which shows that it can re-evaluate its thought process in an anthropomorphic tone.
The model can outperform state-of-the-art models like GPT-4o and Claude 3.5 Sonnet at a small fraction of the memory and storage.
This is where we get into the third main concept in the paper, after reinforcement learning and chain of thought. Model distillation involves training a smaller model (the student) to behave similarly as a larger model (the teacher).
The advantage of model distillation is that it enables the student model to perform at a small fraction of the size. For example, DeepSeek R1 has 671 billion parameters. However, a distilled version of DeepSeek R1 has only 7 billion parameters. The goal here is to make the model even more accessible to people who don’t have a relatively high cost server (e.g. one that is worth ten thousand dollars).
In the experiments, the DeepSeek researchers found that the smaller distilled models largely outperform larger models like GPT-4o and Claude Sonnet 3.5. The distilled models excel in math, coding, and scientific reasoning tasks. However, the distilled models are accomplishing this at a small fraction of the memory and storage required to use them.
There are four reasons why the model runs so efficiently.
It is believed that DeepSeek made training 45x more efficient. There is an ongoing debate as to whether that’s even possible. However, there appear to be four reasons why DeepSeek R1 and its variations run so efficiently.
The first reason is that the training process involved using 8-bit instead of 32-bit floating numbers, which saved a great deal of memory.
The second reason is the model’s multi-token prediction system. Most Transformer-based LLMs carry out inference by predicting the next token - one token at a time. The quality of multi-token prediction turned out to be no worse than that of single-token prediction.
This approach appears to have doubled inference speed and achieved about 85-90% accuracy.
The third reason involves the compression of key-value indices, which eat up much of the VRAM. The key-value indices represent the individual tokens within the Transformer architecture. The model learns while compressing these indices in a way that captures the essential information and uses far less memory.
Thus, the DeepSeek R1 study shows that it's wasteful to store the full key-value indices, even though that’s what everyone else does.
The fourth reason is the incorporation of Mixture of Experts into the DeepSeek R1 architecture. Specifically, the model is a stack of 61 Transformer decoder blocks. The first three are dense, and the remaining 58 are Mixture of Experts layers.
Because the model’s architecture includes Mixture of Experts, it decomposes a large model into smaller models that can run on consumer-grade GPUs. Hence, the accessibility of the distilled models.
You can probably run a version of the model on your local machine(s), but it may be a challenge.
Several versions of DeepSeek R1 are available on Hugging Face.
DeepSeek R1 and DeepSeek R1-Zero are available, as I mentioned in an earlier section. The current version of each has 685 billion parameters.
There are also several distilled versions at 70 billion parameters, 32 billion parameters, 14 billion parameters, 8 billion parameters, 7 billion parameters, and 1.5 billion parameters. Of course, these are supposed to run on machines that are more affordable and accessible.
David Plummer, a retired software engineer formerly at Microsoft, talked about running DeepSeek R1 in a YouTube video. He said that the main version, which had 671 billion parameters at the time, can run on an AMD thread Ripper equipped with an NVIDIA RTX 6000 Ada GPU that has 48 GB of vram (total cost is around ten to fifteen thousand USD).
However, he also says that the 32 billion parameter distilled version runs nicely on a MacBook Pro.
There is an effort to re-build DeepSeek R1 and share it with the open source community.
Although the open source model itself is available online, the company never published the code and the datasets.
There is a project called Open-R1, and the goal is to replicate all parts of the DeepSeek R1 pipeline. It is a community driven effort to fully understand what kind of algorithms, code, and data are necessary to emulate the high performance of DeepSeek R1. The best part is that this is a building in public project, so all information will be publicly available.
Also, Open-R1 will likely eliminate the concern that your data might go to DeepSeek even when running the model locally.
In conclusion, DeepSeek R1 is not perfect. There are still concerns even in regards to running the model locally. However, I think this study will definitely pave the way for democratization and privacy in AI.