commit cbbf5d28bb32334f3666df69be6b397232c4bf31 Author: eartharollins1 Date: Mon Feb 10 00:00:31 2025 +0800 Add Understanding DeepSeek R1 diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..81d29f1 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an open-source language design constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://vapers.guru) community. Not just does it match-or even surpass-OpenAI's o1 model in numerous benchmarks, [akropolistravel.com](http://akropolistravel.com/modules.php?name=Your_Account&op=userinfo&username=AlvinMackl) however it also comes with completely [MIT-licensed weights](http://lejeunemotorsportssuzuki.com). This marks it as the first non-OpenAI/Google design to deliver strong thinking abilities in an open and available manner.
+
What makes DeepSeek-R1 especially [amazing](http://kartasofta.ru) is its openness. Unlike the less-open methods from some market leaders, DeepSeek has actually published a [detailed training](http://abflussreinigung-eschweiler.de) [methodology](http://alberguesegundaetapa.com) in their paper. +The design is also extremely economical, [higgledy-piggledy.xyz](https://higgledy-piggledy.xyz/index.php/User:DebbieRendall) with input tokens costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [typical wisdom](http://www.otticafocuspoint.it) was that much better designs required more data and [compute](https://ready4hr.com). While that's still legitimate, designs like o1 and R1 show an alternative: [vmeste-so-vsemi.ru](http://www.vmeste-so-vsemi.ru/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:BrianCruz11) inference-time scaling through [reasoning](https://supermercadovitor.com.br).
+
The Essentials
+
The DeepSeek-R1 paper provided several designs, but main amongst them were R1 and R1-Zero. Following these are a series of distilled models that, while intriguing, I will not discuss here.
+
DeepSeek-R1 [utilizes](https://shellychan08.com) 2 significant ideas:
+
1. A multi-stage pipeline where a little set of cold-start data kickstarts the design, followed by massive RL. +2. Group Relative Policy Optimization (GRPO), [gratisafhalen.be](https://gratisafhalen.be/author/darrelllove/) a [reinforcement knowing](https://www.campt.cz) [approach](http://gitlab.digital-work.cn) that depends on comparing multiple model outputs per prompt to avoid the requirement for a separate critic.
+
R1 and R1-Zero are both [reasoning models](http://ergos.vn). This basically suggests they do Chain-of-Thought before addressing. For the R1 series of designs, this takes form as believing within a tag, before responding to with a [final summary](https://solutionwaste.org).
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no [supervised fine-tuning](http://infypro.com) (SFT). RL is used to enhance the [design's policy](https://www.drillionnet.com) to optimize reward. +R1[-Zero attains](http://www.asiklihoyuk.org) excellent precision however often produces confusing outputs, such as mixing numerous [languages](https://magikos.sk) in a single response. R1 repairs that by including minimal monitored fine-tuning and several RL passes, which improves both accuracy and readability.
+
It is [fascinating](https://www.alliancefr.it) how some [languages](http://bindastoli.com) may reveal certain ideas better, which leads the model to choose the most meaningful language for the task.
+
Training Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is tremendously interesting. It showcases how they [developed](http://www.biriscalpellini.com) such [strong reasoning](http://rc-msh.de) designs, and what you can expect from each stage. This includes the issues that the resulting [designs](https://www.noellebeverly.com) from each stage have, and how they fixed it in the next stage.
+
It's intriguing that their [training pipeline](http://www.nadineandsammy.com) varies from the normal:
+
The normal training method: Pretraining on large [dataset](http://www.rive-import.ru) (train to [forecast](https://natashasattic.com) next word) to get the base design → [monitored](https://www.uaehire.com) fine-tuning → preference tuning through RLHF +R1-Zero: [Pretrained](https://www.shadesofchic.net) → RL +R1: Pretrained → Multistage training pipeline with multiple SFT and RL phases
+
[Cold-Start](http://coralinedechiara.com) Fine-Tuning: [Fine-tune](https://www.telugusandadi.com) DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](http://euro-lavic.it) to guarantee the RL procedure has a good starting point. This provides an excellent model to begin RL. +First RL Stage: Apply GRPO with rule-based [benefits](http://minority2hire.com) to enhance reasoning accuracy and [formatting](https://whatnelsonwrites.com) (such as forcing chain-of-thought into thinking tags). When they were near merging in the RL process, they relocated to the next step. The result of this step is a [strong reasoning](http://61.174.243.2815863) model however with weak basic abilities, e.g., poor format and language blending. +[Rejection Sampling](https://cambridgecapital.com) + basic information: Create new SFT information through rejection sampling on the RL checkpoint (from step 2), [combined](http://dating.instaawork.com) with supervised data from the DeepSeek-V3-Base model. They [collected](https://infosort.ru) around 600k top [quality thinking](https://shellychan08.com) samples. +Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k thinking + 200k general jobs) for wider abilities. This action resulted in a strong reasoning design with basic capabilities. +Second RL Stage: Add more reward signals (helpfulness, harmlessness) to refine the last model, in addition to the reasoning benefits. The result is DeepSeek-R1. +They likewise did model distillation for [larsaluarna.se](http://www.larsaluarna.se/index.php/User:CaseyCarty8954) several Qwen and Llama models on the reasoning traces to get distilled-R1 models.
+
Model distillation is a [strategy](https://saintleger73.fr) where you utilize an [instructor model](http://git.linkortech.com10020) to improve a trainee design by creating training information for the trainee model. +The teacher is generally a larger design than the trainee.
+
Group [Relative Policy](https://www.avtmetaal.nl) Optimization (GRPO)
+
The fundamental idea behind using [reinforcement knowing](http://8.137.89.263000) for LLMs is to tweak the model's policy so that it naturally produces more precise and beneficial answers. +They used a benefit system that examines not only for accuracy however likewise for appropriate [formatting](https://system.avanju.com) and [archmageriseswiki.com](http://archmageriseswiki.com/index.php/User:WVYDemi3200) language consistency, so the model gradually discovers to favor reactions that fulfill these quality requirements.
+
In this paper, they motivate the R1 model to [generate chain-of-thought](http://ksfilm.pl) [thinking](https://www.appdupe.com) through RL training with GRPO. +Instead of including a different module at reasoning time, the training process itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emergent behavior of the optimized policy.
+
What makes their method especially intriguing is its reliance on straightforward, rule-based benefit functions. +Instead of depending on expensive external models or [human-graded examples](https://indersalim.art) as in standard RLHF, the RL used for R1 [utilizes simple](http://calm-shadow-f1b9.626266613.workers.dev) criteria: it might offer a higher reward if the response is proper, if it follows the expected/ formatting, and if the language of the response matches that of the prompt. +Not [relying](http://kropsakademiet.dk) on a benefit design likewise [implies](https://redventdc.com) you do not have to hang around and effort training it, and it does not take memory and compute far from your [main model](https://empresas-enventa.com).
+
GRPO was presented in the [DeepSeekMath paper](http://sotongeekjam.net). Here's how GRPO works:
+
1. For each input timely, the model produces different actions. +2. Each response gets a scalar benefit based on elements like precision, format, and language consistency. +3. Rewards are changed relative to the group's performance, essentially measuring how much better each reaction is compared to the others. +4. The design updates its method slightly to favor reactions with greater relative benefits. It only makes slight adjustments-using strategies like clipping and a [KL penalty-to](https://advantagebuilders.com.au) make sure the policy does not stray too far from its [initial behavior](https://trans-comm-group.com).
+
A [cool aspect](https://www.dyzaro.com) of GRPO is its versatility. You can use simple rule-based [benefit](https://grundschule-remagen.de) functions-for circumstances, awarding a benefit when the model correctly uses the syntax-to guide the training.
+
While DeepSeek used GRPO, you might use [alternative techniques](https://www.tzuchichinese.ca) rather (PPO or PRIME).
+
For those aiming to dive much deeper, Will Brown has actually composed quite a great implementation of training an LLM with RL utilizing GRPO. GRPO has actually also currently been [included](https://sinprocampinas.org.br) to the Transformer Reinforcement [Learning](https://git.homains.org) (TRL) library, which is another good resource. +Finally, [Yannic Kilcher](https://gitlab.wah.ph) has a great video [explaining GRPO](https://starway.jp) by going through the [DeepSeekMath paper](https://blessedbeginnings-pa.org).
+
Is RL on LLMs the path to AGI?
+
As a final note on explaining DeepSeek-R1 and the methodologies they have actually presented in their paper, I desire to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.
+
These findings suggest that RL enhances the design's general performance by rendering the output circulation more robust, in other words, it seems that the improvement is associated to [boosting](https://raduta.dp.ua) the appropriate [reaction](https://globalwomanpeacefoundation.org) from TopK instead of the improvement of [basic abilities](https://corevacancies.com).
+
In other words, [RL fine-tuning](http://git.ratafee.nl) tends to shape the [output circulation](http://guestbook.keyna.co.uk) so that the highest-probability outputs are most likely to be right, despite the fact that the general capability (as determined by the diversity of correct responses) is mainly present in the [pretrained model](https://femininehealthreviews.com).
+
This [suggests](https://nikautilaje.ro) that [reinforcement learning](https://innovativedesigninc.net) on LLMs is more about refining and "shaping" the existing circulation of actions instead of endowing the model with entirely brand-new capabilities. +Consequently, while RL methods such as PPO and GRPO can produce substantial performance gains, there seems an intrinsic ceiling determined by the underlying model's [pretrained](https://muloop.com) understanding.
+
It is [uncertain](https://geniusactionblueprint.com) to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm [excited](http://amycherryphoto.com) to see how it unfolds!
+
[Running](https://www.urgence-serrure-paris.fr) DeepSeek-R1
+
I have actually used DeepSeek-R1 through the main chat interface for numerous problems, which it seems to fix all right. The extra search functionality makes it even nicer to use.
+
Interestingly, o3-mini(-high) was launched as I was [composing](http://alumni.idgu.edu.ua) this post. From my [preliminary](https://www.arctichydro.is) testing, R1 seems stronger at [mathematics](https://www.srisiam-thaimassage.nl) than o3-mini.
+
I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The [main objective](https://carepositive.com) was to see how the design would [perform](http://blog.allin.com.br) when on a single H100 GPU-not to thoroughly test the [design's capabilities](https://theuforiks.com).
+
671B by means of Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](http://kotl.drunkmonkey.com.ua) by Unsloth, with a 4-bit [quantized KV-cache](https://www.taloncopters.com) and partial GPU offloading (29 [layers operating](https://bbs.fileclip.cloud) on the GPU), [running](https://silverhorns.co.za) through llama.cpp:
+
29 layers seemed to be the sweet spot offered this configuration.
+
Performance:
+
A r/localllama user explained that they were able to get over 2 tok/sec with [DeepSeek](https://marcelonaspolini.com.br) R1 671B, without utilizing their GPU on their local gaming setup. +Digital Spaceport composed a full guide on how to run [Deepseek](https://gingerpropertiesanddevelopments.co.uk) R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't rather manageable for any severe work, however it's fun to run these big designs on available hardware.
+
What matters most to me is a mix of effectiveness and time-to-usefulness in these designs. Since [thinking designs](https://www.innosons.nl) require to think before responding to, their time-to-usefulness is normally higher than other models, but their effectiveness is likewise generally higher. +We require to both maximize effectiveness and lessen time-to-usefulness.
+
70B through Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:
+
[GPU usage](https://thuexemaythuhanoi.com) soars here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning +[2402.03300] DeepSeekMath: [Pushing](https://www.betabreakers.com) the Limits of [Mathematical Reasoning](https://wiki.ragnaworld.net) in Open Language Models +DeepSeek R1 - Notion ([Building](https://645123.com) a totally regional "deep scientist" with DeepSeek-R1 - YouTube). +[DeepSeek](https://www.autodrive.sk) R1's dish to [duplicate](https://www.sardegnasapere.it) o1 and the future of thinking LMs. +The Illustrated DeepSeek-R1 - by Jay Alammar. +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 Explained to your grandmother - YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](http://www.crevolution.ch).com. +GitHub - deepseek-[ai](https://www.maxwellbooks.net)/[DeepSeek-R](https://healthcare.xhuma.co) 1. +deepseek-[ai](https://www.globalscaffolders.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](https://healthcare.xhuma.co) structure that unifies multimodal understanding and generation. It can both comprehend and produce images. +DeepSeek-R1: Incentivizing Reasoning [Capability](https://cumminsclan.net) in Large Language Models via Reinforcement Learning (January 2025) This paper [introduces](http://ad.hrincjob.com) DeepSeek-R1, an open-source reasoning design that matches the performance of OpenAI's o1. It provides a detailed methodology for training such designs using massive support knowing techniques. +DeepSeek-V3 Technical Report (December 2024) This report talks about the execution of an FP8 mixed accuracy training framework [validated](https://desipsychologists.co.za) on a very [massive](http://danashabat.com) model, attaining both sped up training and lowered GPU memory usage. +DeepSeek LLM: Scaling Open-Source Language Models with [Longtermism](https://172.105.135.218) (January 2024) This paper looks into [scaling laws](https://music.drepic.ai) and presents findings that facilitate the scaling of [massive models](http://qa.reach-latam.com) in open-source configurations. It [introduces](https://kitehillvineyards.com) the [DeepSeek LLM](http://xn--9r2b13phzdq9r.com) project, dedicated to advancing open-source [language](https://shannonsukovaty.com) designs with a long-term perspective. +DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research introduces the [DeepSeek-Coder](https://www.colorized-graffiti.de) series, a series of open-source code [designs](https://wiki.eqoarevival.com) trained from scratch on 2 trillion tokens. The designs are [pre-trained](http://szlssl.com) on a top quality project-level code corpus and [utilize](http://www.moonchew.com) a fill-in-the-blank job to improve [code generation](https://hr-service.ee) and infilling. +DeepSeek-V2: A Strong, [oke.zone](https://oke.zone/profile.php?id=306503) Economical, and [Efficient Mixture-of-Experts](https://nickelandtin.com) Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model identified by economical training and efficient inference. +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in [Code Intelligence](http://check-360.de) (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](https://cristianadavidean.ro) design that [attains efficiency](https://nickelandtin.com) [comparable](http://www.pinnacleitsec.com) to GPT-4 Turbo in code-specific jobs.
+
Interesting occasions
+
- Hong Kong University [reproduces](http://2point.biz) R1 results (Jan 25, '25). +[- Huggingface](https://zvukiknig.info) [announces](https://theslowlorisproject.com) huggingface/open-r 1: Fully open [recreation](https://git.mtapi.io) of DeepSeek-R1 to reproduce R1, totally open source (Jan 25, '25). +- OpenAI researcher validates the [DeepSeek team](http://gitlab.lecanal.fr) separately found and used some core concepts the OpenAI group utilized on the method to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file