RL is so hot right now!

Week 8 of Coding with Intelligence

Feb 23, 2025

Dear readers! I’m back from a short break. Things got very busy at Groq ⚡️ and in my personal life (I bought a house in Amsterdam, yay!). We’re very back though and this week’s edition is packed with goodies, not least, the release of the DeepSeek-R1 70B Distill model that I personally worked on at Groq running at crazy average speeds of 1600 tokens per second. Reasoning models have a reputation of being slow but with Groq, no more :-).

It was a great project to work on and I hope you will all build cool things with it. It’s available to paid Groq users as deepseek-r1-distill-llama-70b-specdec (a credit card and a few dollars are all you need).

Enjoy this week’s CoWI: Coding with Intelligence 🧠

📰 News

Groq launches support for DeepSeek-R1 70B Distill running at over 1600 tokens per second
DeepSeek-R1 70B Distill is the DeepSeek-R1 distillation on top of Llama 3.3 70B. Through clever use of speculative decoding and a custom draft model we are able to serve this strong reasoning model at over 1600 tokens per second of output generation speed. Truly remarkable to observe. It performs significantly better than Llama 3.3 on logic riddles, complex RAG reasoning, code generation and math problems. Full benchmarks in this official HF repo. In addition we launch DeepSeek-R1 Qwen 32B Distill, Qwen 32B and Qwen 32B Coder, three very capable models based on the Qwen 2.5 series. Especially Qwen 2.5 32B is a very capable general purpose model for tasks like RAG.
DeepSeek-R1: #1 open source reasoning model
This model beats OpenAI's o1-1217 full o1 model on various tasks like AIME 2024 (math test), SWE-bench Verified (a software engineering task that requires producing GitHub PRs that implement scoped tasks for open source software like Django), and MATH-500. The techniques used in R1 (reinforcement learning on a strong base model) benefit mainly performance on coding, math and logic problems. An incredible release and gift to the community.

Noteworthy is the addition of several distilled models such as Qwen (1.5B up to 32B) and Llama 3.1 8B and 3.3 70B. Those models seem to benefit greatly from being finetuned (no reinforcement learning) on traces generated by the full R1 model.

For the model details, the R1 model is a MoE model with 671B parameters and 37B active parameters, 128K context length and basic support for function calling. It being a reasoning model will tend to generate many more output tokens for a given user prompt. The long Chain of Thought is encapsulated in a <think></think> XML tag after which the final answer is provided by the model.
Grok-3 released by xAI: #1 in LMSYS leaderboard in all categories
Impressive release by xAI. They beat o3-mini high, o1, DeepSeek-R1 levels with their latest Grok 3 reasoning models and for some benchmarks beat all of those by a decent margin like in LiveCodeBench (v5) or AIME’25. It was trained on a cluster of 200k H100s, although it's not clear whether all of those GPUs were used for a single run. xAI uses a similar strategy of having both reasoning and non-reasoning models in their latest frontier line-up and have both a big and small model variant. API access is not yet available so benchmark reproductions have not yet happened broadly, although consensus is that this model seems very good.

📦 Repos

verl: Volcano Engine Reinforcement Learning for LLMs
This training stack for large reinforcement learning LLM training by Bytedance is quickly rising in popularity. It's a neat addition to the Hugging Face trainer called TRL.
Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore
A very cool large scale retrieval system that can perform indexing and querying of a very large scale text corpus. They introduce this system in the context of exploring compute optimal scaling laws (see paper). They state that "such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks" implying it's more efficient to assume and make available large amounts of data at inference time to dynamically add knowledge instead of using purely pre-training.

Interestingly, this is aligned with earlier remarks from Wen-mei Hwu (of Programming Massively Parallel Processors book fame) that he shared at a GPU-MODE (kernel programming community) meetup where he said he expects models to "remember" less factual data and retrieve more factual data at inference time. As a path to total system optimization (traditional LLMs have a lot of memory bandwidth pressure because of their size). In a way this is "just RAG" but much more widely than just retrieving a small amount of relevant documents.
Verifiers: a set of tools for reinforcement learning with LLMs in verifiable environments
This is an awesome project showing how to reproduce DeepSeek-R1's proposed method of performing reinforcement learning on strong base models using the GRPO algorithm. The unique value of this repo is on offering an e2e training script to run and allowing you to define reward signals for the GRPO algorithm to use. The author, Will Brown - who is an AI researcher at Morgan Stanley, describes this kind of reward signal creation as rubric engineering. Check out his excellent talk on this idea and play with the code!

If you want to get good intuition on the GRPO algorithm itself first check out the GRPO explainer video linked in this newsletter's edition.
Uncensored DeepSeek-R1 1776 by Perplexity
They say they've "post-trained it to remove Chinese Communist Party censorship". A useful contribution for everyone that wants their AI features powered by DeepSeek to be a little more Western and a little less CCP. This might have implications for how the CCP sees model drops by Chinese companies like the High-Flyer hedge fund backed DeepSeek. Hopefully there's a limited chilling effect as Chinese teams are great at innovating in this space. The team from Perplexity carefully analyzes benchmark scores in order to avoid performance degradation as a result of the uncensoring. It looks like they've been able to maintain full model performance.
s1: Simple test-time scaling
A simple baseline reproduction of o1-like scaling by performing strictly SFT (not RL tuning) on a strong instruct model (they use Qwen 2.5 32B). They show that a tiny amount of high quality curated reasoning data (1K samples) can make a big difference.

📄 Papers

Optimizing Large Language Model Training Using FP4 Quantization
Frontier teams, in this case Microsoft, continue to push the efficiency of model training. In this paper they show how to mitigate the quantization errors that appear in naive FP4 LLM training. They show BF16 and FP8 levels of quality in experiments at the 100B token/13B parameter scale.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
So far, most test-time compute implementations scaled the inference in "token space" where they keep generating tokens in the final output stream to produce long Chain-of-Thought streams in order to improve final answer accuracy. This approach of reasoning in latent space deeper within the model architecture "can capture types of reasoning that are not easily represented in words". They show that their 3.5B model can reach quality of 50B sized models on certain problem cases.

If you've heard of the 🔄 shape rotaters 🔄 > 🗣️ wordcels 🗣️ meme then you'll probably be glad to hear that shape rotating can sometimes indeed be the winning approach.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
DeepSeek releases an inference technique that utilizes train-time algorithms to improve long-context inference performance. They mention scaling up to 64K context window lengths in the paper, the DeepSeek-R1 model goes up to 128K context window but the official DeepSeek inference API goes up to 64K. Most likely, because they're using NSA to improve the efficiency with which they host the model. Very kind of them to share this innovation so openly to help others host DeepSeek better 🙌
Training Software Engineering Agents and Verifiers with SWE-Gym
RL gyms are hot again! They were back in 2016 (check out this throwback banger by OpenAI). This time because various labs (OpenAI, DeepSeek) have found that performing RL on a very strong base LLM (crucial) with verifiable rewards (gyms) actually works.

📚 Resources

The Ultra-Scale Playbook: Training LLMs on GPU Clusters
An incredibly detailed guide from the kind folks at Hugging Face explaining the details of large scale distributed LLM training. They go in-depth on topics like data parallelism (ZeRO-1|2|3) and advanced topics like FP8 training, Flash Attention other kernel optimizations. Great gift to the community. Now all you need is to be GPU rich 🤓
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Meta open sources a large reasoning dataset that they've constructed purely by searching in pre-training data for reasoning traces and by generating relevant matching questions/prompts. Their full data pipeline is LLM based and uses their latest Llama 3.3 70B Instruct model.
AgentStack: start your agent project
A neat scaffolding tool to build agents, it supports the CrewAI, LangGraph, OpenAI Swarms and LlamaIndex frameworks in addition to multiple LLM, tool and observability providers (although as it's a project from AgentOps it biases towards this provider for the observability part).
Tool Use, Unified by Hugging Face
Need write-up of tool use formatting by Matthew Carrigan from Hugging Face. The post explains how tool definitions and tool call signatures are standardized but that more work is required on parsing tool call formats. Overall, tool use is still a complex formatting issue especially for advanced features like constrained decoding and streaming. Luckily my work at Groq aims to make this problem go away magically when you use tools with Groq inference APIs.
Model Context Protocol continues to grow: a talk about what's next by Anthropic
A talk should go up later and I will share it once it does. The TLDR is that MCP seems to be the bet by Anthropic on how agent protocols will take shape and more advanced MCP use cases like making MCP servers both client/server and remote MCP servers, complex authentication support are coming soon. I'd recommend checking the MCP docs every now and then as it's developing quickly.

I think going into why MCP could be the basis of agents warrants an entire separate blog post but the TLDR is that MCP makes it easy to delegate tool execution such that an agent (client program) doesn't need to implement using the tools (e.g. a tool call to fetch Linear tickets can now be delegated by an MCP server that the actual Linear team maintains).

It also allows for complex delegation patterns (sub-agents) where an MCP server handling a tool call itself delegates to other MCP servers and/or makes use of LLM inference in order to satisfy the tool call. This leads to tool calls themselves becoming more "high-level" instead of pure code-style function calling (e.g. fetch JSON of weather data of city X) but more like "fetch my Uber's order status" (perhaps using a delegate browser agent implemented as an MCP server). I like to think of MCP servers as a "GraphQL" like layer that is optimized on the frontend (what the LLM gets) for natural language, but in the backend can call regular REST APIs to perform a requested tool call.
General Reasoning releases large scale reasoning dataset
123k outputs from popular reasoning models like R1, R1-Zero, LIMO, DeepHermes, OpenThoughts, R1-Distil-L70B, DeepScaleR. They have comparison answers from o3-mini and gemini-flash-thinking.
[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Excellent Yannic Kilcher video, as always, about GRPO. He makes the RL algorithm, which is a derivative of PPO (invented by OpenAI co-founder John Schulman) surprisingly easy to understand.
Jeff Dean & Noam Shazeer – 25 years at Google: from PageRank to AGI
Two OG Googlers discuss AGI and the technologies to get there. Fascinating listen.
SuperGPQA: large scale GPQA-like benchmark
The numbers: 26,529 questions across 13 disciplines, 72 fields, and 285 graduate-level disciplines. Some top model rankings: DeepSeek-R1 (62) > o3-mini (high reasoning effort) (55) >= Doubao-1.5-Pro (55) > Claude-3.5-Sonnet (48).

Want more? Follow me on X! @ricklamers

Coding with Intelligence

RL is so hot right now!

Week 8 of Coding with Intelligence

Discussion about this post