Coding with Intelligence

Model drops: Nemotron 3 Ultra, MiniMax M3, Opus 4.8 + Claude Dynamic Workflows, RL APIs in vLLM & more

Rick Lamers — Mon, 01 Jun 2026 16:59:33 GMT

April 2026 update

Rick Lamers — Mon, 20 Apr 2026 11:27:04 GMT

It’s been a while since my latest Coding with Intelligence edition, sorry for the wait (Nvidia AI research life is not for the faint of heart!), but I’m back with some fresh updates and will try to keep a monthly cadence going now.

The format is I’ll take some of the top posts from the X feed that either I posted or caught my eye. Link to all posts included so you can dig deeper. This time some open source action (Qwen 3.6, MiniMax M2.7) some SOTA SWE numbers from Mythos, autoresearch trend (leave a comment about what your agents are autoresearching and how you fuel them strong hypotheses/ideas!) and a Kimi 2.5 swarm deep dive from their GTC talk. Enjoy!

View on X

An update from Coding with Intelligence

Rick Lamers — Tue, 25 Nov 2025 14:25:24 GMT

Hi all, things have been crazy busy at Groq, with some very exciting releases coming up. As a result I haven’t been able to dedicate time to documenting what is happening in AI as much as I would have liked.

To integrate things in a way that feels sustainable with the work I’m doing I've decided to go X-only for the time being wrt updates on what is happening in AI.

In this edition I will highlight a number of recent posts from my X account that I think capture interesting developments of the past month. Of course you can just follow me on X to keep getting updates more frequently. In the future I might return to posting on Substack again, in the interim, I will occasionally highlight some of the most relevant posts I've seen/retweeted/written on X.

New frontier? Continuous learning and RL environment scaling

Rick Lamers — Thu, 04 Sep 2025 13:28:51 GMT

📰 News

Trust me bro, just one more RL scale up, this one is the real one with the good envs
Chief scientist Ryan Greenblatt at Redwood Research saying the quiet part out loud. There's been a lot of attention for RL environments as a way to improve model's capabilities. The basic idea is to create realistic simulation environments in which language models run as agents (capable of using tools to interact with said environment) to improve the model's capability (by rewarding behaviors/paths that lead to a good result in the environment, like passing unit tests/reaching an objective).

Ryan rightfully questions the narrative that we're close to a very large acceleration due to AI labs scaling up on these kinds of environments. His key argument is that he expects to a decent extent these kind of environments (e.g. basic coding environments with access to a terminal/compiler/unit test running) have already been used by top labs (e.g. at the Grok 4 launch scaled up RL was discussed) and that verification of non coding tasks could remain a bottleneck for creating useful and diverse RL environments.

I especially appreciate the Q&A section where he entertains the strongest counterpoints people might have. Excellent format to defend his points. Overall though, he agrees that there's still a lot of potential in scaling up the number/diversity/quality of RL environments. What I appreciate in particular are the points around there being evidence for better verification techniques for non-trivially verifiable domains as likely evidenced by IMO gold medal (verifying a written math proof is not trivial, especially if not using Lean-like-solvers during inference). And the insight that we might be on an acceleration loop of RL environment creation due to SWE agents accelerating coding so much (as evidenced by Cursor/Codex/SWE-bench results) and that work being (potentially) largely automatable with human scientist/engineer oversight.
OpenAI puts the "Open" back in OpenAI with the launch of gpt-oss 20B & 120B models
Awesome open source model drop by OpenAI, these MoE models with high expert count (128 and 32 respectively) perform remarkably well. Especially when combined with function calling.

We're hosting them at Groq and I've build a demo of these models powering an AI chat that can interface directly with a spreadsheet (all in your browser!): https://autosheet.groqlabs.com/ (that project itself is also open source, hack the source!)
MCPMark: a comprehensive evaluation suite for evaluating the agentic ability of frontier models
A modern agentic eval that measures the performance of various models on e2e tasks involving MCP servers focused on Filesystems, Notion, Playwright, GitHub, Postgres. Interesting lead for GPT-5, although it's on par with Claude 4.1 Opus on the Notion tasks.
Gemini ships SOTA image editing model
Gemini 2.5 Flash Image aka nano-banana has been floating in the ether for a while now with early access on Yupp.ai and LMArena. The model is incredibly powerful for stable image editing where only targeted modifications are applied. It beats both OpenAI's gpt-image-1 and Gemini's previous model by a wide margin as evidence by the Arena scoring gap. It is so good that on Reddit people are posting comparisons of old Photoshop paid work they've done and how nano-banana zero-shots these tasks, often with higher quality (like getting reflections right). If you were looking for specific and real examples of work displacement than look no further. Incredible release putting the Gemini chat app in the lead of image editing/generation capabilities.
Grok enters AI coding with Grok Code Fast 1
The model is quite fast, but quality seems to be close to/slightly worse than Claude Sonnet 4 in practice. If they can keep the speed and boost the quality to GPT-5 level they would have a real killer combo on their hands. They subsidized the launch heavily by giving away a lot of tokens to the key coding agent (GitHub Copilot, Cline, opencode, Cursor, Kilo Code, Roo Code, Windsurf) and OpenRouter. Leading to increased adoption and them leading on OpenRouter coding token consumption (momentum seems to have kept up even as the free period has ended). I suspect the speed will be copied by others as the qualitative effect of "staying in flow" is really valuable while coding. If anyone wants to achieve that using novel chip architecture (LPUs) hit me up!
GPT-5: Key characteristics, pricing and model card
GPT-5 has been out now for close to a month, and I think overall the routing mode has not been received very well (in ChatGPT the Auto mode often selects the fast models on questions that it should really use the thinking model for and gets it wrong as a consequence). The GPT-5 Thinking model is quite good but takes a long time. They've seemed to have pushed themselves out of the attractive "high quality instant but non-reasoning" model market for the moment. Direct non-thinking responses from Opus 4.1 feel in a league of their own compared to GPT-5 Instant as an example.

This article by Simon Willison does a great job capturing all the nuance of the release.

I'd add that in my anecdotal use of GPT-5 inside Cursor the model performs really really well, although for trickier problems I tend to switch to Opus 4.1, which is way to expensive to use for everything (at about 10x the cost of GPT-5). I suspect GPT-5 is good in Cursor because one of the areas it shines in is steerability. And I think Cursor adds a lot of sensible context in their (tool) prompts and I like being quite specific about how the model implements things as I'm generally not vibe coding but just doing "assisted software engineering".
Gemini teases Genie 3: a high definition world model simulator that can be manipulated
The example of the paint brush painting a wall with persistence (even when looking away) is quite compelling. No playable demo unfortunately. Will be interesting what the value of world model simulators like this will be moving forward. Will they be the basis for embodied simulation RL training or more of a standalone product experience like promptable game engines for games/interactive experiences?
Qwen-Image & Qwen-Image-Edit
Strong image editing capabilities, best-in-class open source model alternative to nano-banana/Gemini Flash Image 2.5 with capabilities similar to Flux Kontext (as one Reddit user puts it "I'd say Qwen Image editing is slightly superior to Kontext in prompt following when editing image")
Gemini 2.5 Pro Deep Think ships in Gemini app
Ultra subs only. No API access. But pretty wild such strong intelligence available pretty much self-serve.
Fast Reasoning on GPT-OSS with Speculative Decoding and Arctic Inference
Interesting work from Snowflake accelerating open source inference by using the hidden state of the model to predict more tokens ahead as a form of speculative decoding.
LongCat-Flash-Chat: a 560BA18.6B∼31.3B MoE
Another foundational model release from China that pushes the frontier with a novel dynamic activation parameter count that is context window specific. It outperforms DeepSeek V3.1 on certain benchmark truly pulling its weight in the frontier bucket and reaching parity with Claude 4 Sonnet on many coding tasks. An interesting artifact to study and good model to run. Perhaps more importantly, indicating fierce open source competition in China leading potentially to more open source model development acceleration.
GLM-4.5V: a very strong vision language model by Zhipu AI
Based on GLM-4.5 Air (108B) significantly outperforming Gemma 3 27B.
Prime Intellect announces Environments Hub
A cool project to facilitate a push for open source RL environment development. See the "Trust me bro, just one more RL scale up" post for more context on the need for these kinds of RL environments. Will be interesting to see how many environments get created and what experimental RL training runs show in terms of gains from these environments on downstream tasks. At the time of writing about 125 environments seem to have been created.

📦 Repos

Cartridges: Storing long contexts in tiny caches with self-study
An exploration of continuous learning by the Stanford Hazy Research lab to train KV caches labeled "cartridges", they train a KV cache per corpora (a collection of documents) and use the trained KV cache during inference as an alternative to loading the entire set of documents into the context window.

Radical idea with significant performance implications: using 38.6× less memory and enabling 26.4× higher throughput. It's still more costly than In-Context Learning (ICL, e.g. just stuffing in the context window) in terms of training time (30m for a Cartridge on a Llama 3 8B) v.s. pre-fill time, but there are ideas for speeding this up.
OpenPipe: e2e example of RL training a Deep Research agent on open source model for email search
Noteworthy, OpenPipe was just acquired by CoreWeave. Likely to assist with training open source models for specific use cases using SFT/RL.

The implementation is based on the open source ART learning framework by OpenPipe which itself is based on torchtune.
Open-dLLM: Open Diffusion Large Language Models
No other diffusion LLM project had open sourced both data and training code, only projects with inference, evaluation, and weights were available so far. This project changes that with total openness on all of those dimensions. Helpful to study diffusion-based LLMs e2e! Performance on coding tasks like HumanEval lags quite a bit behind other open weight models like Dream 7B (20.8 vs 56.7) so there might be some suboptimal choices in architecture, training, data.
OpenVision 2: simplified pretraining and stronger vision encoder performance
Continued impressive work on vision encoders by the team at UCSC (and folks from Apple/Berkeley). Especially useful due to the full transparency on method, training code and data. OpenVision 2 boasts 2x reduction in training time and memory use while achieving higher quality than the original OpenVision encoders.

📄 Papers

Jointly Reinforcing Diversity and Quality in Language Model Generations
A new paper from the Meta FAIR team says the usual push for accuracy and helpfulness tends to squeeze out response diversity, which hurts creativity. They show you can optimize for both diversity and quality at the same time, with strong results on verifiable math and creative writing. The good news: there seems to be no inherent trade-off. Potentially less AI slop in the future!
VerlTool: a toolkit for tool-integrated reasoning training
They released a paper show performance gains from using VerlTool versus other approaches of training LLM agents to use tools during reasoning. This is like the GPT-5 Thinking/o3 mode in ChatGPT that can use image manipulation and web search while it's rolling out its Chain-of-Thought reasoning. Their framework VerlTool allows you to train this capability into an existing model using RL training algorithms like GRPO and DAPO.

📚 Resources

Grok 2 model was open sourced
Interesting artifact to study, but probably not very useful as other better models have been released. https://huggingface.co/xai-org/grok-2/discussions/24 some details on the architecture: 270B Total, 115B Activated MoE with 8 experts. Based on the Grok 2 launch blog post it reaches the level of Claude Sonnet 3.5 on tasks like GPQA and HumanEval (a little below).
The Hidden Drivers of HRM's Performance on ARC-AGI
A breakdown of the HRM's model performance on ARC-AGI-1 (41%) with a 27M model (small!).
Automatically Jailbreaking Frontier Language Models with Investigator Agents
"We train investigator agents using reinforcement learning to generate natural language jailbreaks for 48 high-risk tasks involving CBRN materials, explosives, and illegal drugs. Our results show success against models including GPT-5-main (78%), Claude Sonnet 4 (92%), and Gemini 2.5 Pro (90%). We find that small open-weight investigator models can successfully attack frontier target models, demonstrating an approach to cost-effective red-teaming." couldn't have said it better. Interesting find + deep dive on how jailbreaking isn't solved, even at the frontier.
[Video playlist] GPU MODE meetup talk recordings
GPU MODE usually brings out very competent operators and this time MoE, long context, quantization are discussed (and some more).
Slides of Denny Zhou Gemini reasoning lead at Google DeepMind
Shows reasoning was built on principles of Self-Consistency and that more work on verification is needed to make progress on real-world tasks.
Browser Use Agent benchmark by Princeton's Holistic Agent Leaderboard on OpenAI/Anthropic latest models
They find that there's a large gap between pass@1 and pass@any which indicates there might be a way to get to much higher pass@1 through further optimization (i.e. since pass@any shows there is some way for the current agents to perform the task in the browser, there might be a way to elicit the right behaviors every time, rather than a need for a fundamental capability unlock to achieve these browser tasks). The best model+scaffold combination scores 42.3% on the Online Mind2Web benchmark while pass@any is 88.3%.
Cursor: 1.5x Faster MoE Training with Custom MXFP8 Kernels
Cursor flexing their model training muscle. With Cursor's phenomenal adoption success it's likely they'll do more & more work to compete with frontier labs on coding (adjacent) models. Interesting deep dive on how to get MXFP8 training to be correct & efficient on Blackwell B200s.
Inside vLLM: Anatomy of a High-Throughput LLM Inference System
Aleksa Gordić has a knack for teaching, this blog post effortlessly runs you through the key components of vLLM, from advanced features like disaggregated prefill/decoding to the basics of single to multi-GPU deployment. Great visualizations too!
ast-grep-mcp
Equipping SWE agents with specialized MCP tools to make it operate more intelligently increases the utility and efficiency of these coding agents. ast-grep-cmp gives agents like Cursor and Claude Code the ability to perform structural code search (which as a refresher, is a tool that uses a syntax where you write code patterns with $METAVAR placeholders (like $A, $B) to match and capture AST nodes, allowing you to search for structural code patterns rather than just text).
The Illustrated GPT-OSS
A nice blog post detailing the gpt-oss models released by OpenAI. From model architecture to the new chat template system (harmony).
Global Software Optimization (GSO) SWE leaderboard
A benchmark for very complex SWE tasks, best frontier models score only 8.9% (OpenAI's o3, scores above GPT-5 Thinking high reasoning effort). What I appreciate in particular is the sample descriptions given showing the root of the task and how models fail in solving them. Very helpful new benchmark as labs are closer to saturating benchmarks like SWE-bench Verified.
S3GD Optimizer Algorithm
>Smoothed SignSGD is more computationally and memory efficient relative to Adam variants, pairing perfectly with its sparsity to make it a compelling candidate for both SFT and RFT

New optimizers being shared, will be interesting to follow the other releases of the WhyPhy team.

Want more? Follow me on X! @ricklamers

Qwen team drops 3 (!) SOTA models in one week

Rick Lamers — Fri, 25 Jul 2025 15:18:14 GMT

Qwen3-Coder benchmark scores

📰 News

Qwen releases Qwen3-Coder + a fork of gemini-cli for a terminal coding agent
It is a large MoE model (Qwen3-Coder-480B-A35B-Instruct) with almost double the active parameters compared to the updated Qwen3 235B general reasoning/non-reasoning models. It supports a 256k context window natively with YaRN based extension to 1M (available on the Alibaba-hosted endpoint). The focus seems to be on the ability of the model to be strong at agentic coding tasks as required in the context of Claude Code-esque tools.

Their fork, called qwen-code, of the open source gemini-cli (similar to Claude Code) was done to optimize the prompts for the Qwen3-Coder model. Open Source models tend to benefit an outsized amount from prompt tweaks/iteration for maximally stable performance.

The headline performance number is the SWE-bench Verified subset and it reaches 67% vs 68% (Claude Sonnet 4) vs 48.6 (GPT-4.1), which is impressive to say the least. On terminal bench it reaches 37.5% (vs 35.5% for Claude Sonnet 4), which is crucial in coding tasks that involve running e.g. npm/git/script commands. For the full benchmark figures see the blog post.
Qwen releases Qwen3-235B-A22B updates: dedicated thinking and non-thinking models with close to SOTA performance
It outperforms Gemini 2.5 Pro on LiveCodeBench v6 (25.02-25.05) with 74.1% vs 72.5% while it trails Gemini 2.5 Pro on GPQA with 81.1% vs 86.4%. It gets 18.2% on Humanity's Last Exam while Gemini 2.5 Pro gets 21.6% on that.

What is interesting here is that Qwen decided to release two separate dedicated models for both the thinking and non-thinking version. The team remarked on Twitter that they believe that forcing a single model to do both through system prompt reason-toggling (hybrid approach) sacrificed on both ends (not the best reasoning model, not the best non-reasoning model).

The non-thinking model is called Instruct and can be found here https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
MoonshotAI releases Kimi-K2: a 1T open source MoE
A very strong release with a whopping 1T parameters and 384 experts. The model achieves competitive scores with models like GPT-4.1, Gemini 2.5 Flash and touts reliable function calling for agentic use cases like MCP-server backed agents. See their blog post for a very detailed breakdown of comparison scores. Kudos to the rigorous evals!

A notable feature of this release is their use of Muon as their neural network optimizer. They introduce MuonClip and explain how it was able to scale up to their large 1T model setup. They dropped a tech report covering MuonClip and more here: https://github.com/MoonshotAI/Kimi-K2/blob/main/tech_report.pdf

We're hosting this model on Groq at speeds north of 400 tok/s and I personally worked on tool calling reliability through various optimizations in the chat template and tool call parsing 🔥

Find it here https://console.groq.com/docs/model/moonshotai/kimi-k2-instruct
Google DeepMind achieves IMO gold medal with Gemini Deep Think
The International Mathematical Olympiad (IMO) requires participants to solve 6 complex math problems where the answer is a detailed description of a proof demonstrating the correctness of the proposed solution. This has historically been a challenging problem for language models as the answers are hard to verify (judging the correctness of a proof is typically a lengthy and complex human guided process).

What's noteworthy about Google DeepMind's result on this year's competition is in my eyes mainly the following two things:

1) they've used a purely "token-space" based approach instead of actively relying on tool-use with expert math systems like Lean, it reasons for multiple hours (sub 4.5 hour competition limit) but impressively converges to a stable answer in the end

2) it seems like they used a general model that uses their general purpose Deep Think test time compute strategy as they state that this particular trained model used for the competition will be rolled out directly, confirming that their claim that "they only added some instructions on how to best solve IMO problems and made available a high-quality solutions of other math problems" is true

A nice detail is that they included the full final solutions and they do really read like proper mathematical answers formulated nicely. I won't claim to be able to validate their correctness, but luckily we have AI for that now.

Note: OpenAI has also claimed to have achieved the IMO gold medal but no official writeup was made available other than this thread by lead researchers Noam Brown/Alexander Wei https://x.com/alexwei_/status/1946477742855532918
Contextual releases LMUnit models used for 'unit testing' LLM responses on natural language criteria
Actual Hugging Face collection https://huggingface.co/collections/ContextualAI/lmunit-6879d97293090553a9300abe

Remains to be seen whether it outperforms some of the most recent model releases on 'unit test-like LLM output judging'. But it's cool to see them follow through on their promise of releasing these model finetunes.

The models are licensed under their original respective licenses (qwen/llama).
Mistral releases Speech-to-Text model Voxtral
Competitive with GPT-4o mini Transcribe, Gemini 2.5 Flash and Whisper large-v3.
Reward hacking is becoming more sophisticated and deliberate in frontier LLMs
This post highlights incidents of reward hacking that are more frequently being observed in the wild as a result of frontier labs scaling up reinforcement learning for post-training. Suppress the reward hacks, labs must!
Bytedance showcases Seed GR-3: a robotic focused Vision-Language-Action model
The demos show high dexterity object manipulation like putting a shirt on a hanger. Impressive progress on bringing humanoid robotics work on the back of large language model progress. The release includes a technical report going deeper on both the model and robot arm setup.

📦 Repos

HeavyBall: a collection of high performance optimizer implementations for PyTorch
By Lucas Nestler who's involved with https://keenagi.com (John Carmack's AGI lab).

📄 Papers

Diffusion Beats Autoregressive in Data-Constrained Settings
Interesting argument around data efficiency for language modeling using diffusion models. As various researchers have proclaimed that "we are running out of data" we might find that diffusion models become more popular over time. This in contrast to the paradigm of scaled up RL post-training which seems to be winning at the moment on top of the more classical autoregressive (MoE) models.
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
A Transformer architecture alternative by Albert Gu et al. (Chief Scientist at Cartesia and Assistant Prof. at CMU). Key ideas are related to more fluid character handling than fixed tokenization schemes such as BPE and more dynamic compute allocation through learned routing.
Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning
A neat overview paper covering various kinds of paradigm developments in post-training, from parameter efficient fine-tuning to RL for reasoning. Use it as a starting pointing if you don't know what either of these terms mean.
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Promising new LLM architecture from Google DeepMind + some research labs (Mila, KAIST). The core idea is combining adaptive compute with parameter sharing to increase efficiency. Results up to 1.7B look promising. MoR is the acronym to watch. Let's see if more labs adopt it and whether strong scaled releases drop in the future.
ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context
Interesting concept of generating natural language reasoning traces using MCTS for bootstrapping reasoning patterns in non-reasoning LLMs like Llama 3. By the Meta research team.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Interesting industry-wide position paper on the role of Chain of Thought monitoring for AI safety.
μnit Scaling: Simple and Scalable FP8 LLM Training
Notes on stable FP8 LLM training by folks from Databricks Mosaic Research.
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
This paper explores techniques to help models scale reasoning (test time compute) more dynamically based on whether the problem/prompt requires it. Certainly of interest to the folks at xAI shipping Grok 4 as Elon indicated that the team is working on make the reasoning duration much more adaptive based on the problem difficulty.

📱 Demos

Mirage: The World's First AI-Native UGC Game Engine Powered by Real-Time World Model
Similar to Decart's Minecraft demo, also see the Flappy Bird blog post in this newsletter edition.

🛠️ Products

GitHub Spark
GitHub strikes out to Lovable, Bolt, Claude Artifacts, Replit's of the world with their prompt-to-app launch. See Simon's reverse engineering for the details of how it works (linked in this newsletter).

📚 Resources

🦖 RExBench: A benchmark of machine learning research extensions for evaluating coding agents
Cool project to make progress on AI (coding agents) that can expand beyond the current scope/results of AI research to drive further progress in the field.
[Video] Advancing the Frontier of Silicon Intelligence: the Past, Open Problems, and the Future
Discussion of Open Problems by ex-OpenAI (now Meta Superintelligence) researcher Shuchao Bi given at Columbia University.
[Video] Fei-Fei Li: Spatial Intelligence is the Next Frontier in AI
Highlights the importance of 3D (world) (4D world + time) understanding for solving AGI. Interesting view, as it's easy to come up with some tasks that stress test this capability for a supposed AGI system.
Flappy Bird World Model in the Browser: interactive + real-time diffusion architectures
What a fascinating deep dive into making a Flappy Bird World Model run more efficiently. A peek behind the curtain that powers recently covered projects like Decart's viral Minecraft world model project.
Tokenomics #1: The Pricing Evolution of AI Coding Agents
Interesting breakdown of Dan Nguyen-Huu on the various explorations that code generation startups have gone through as code generation apps have taken off (from Cursor to Lovable to Devin to Claude Code). I agree that the hybrid (base + credits) approach feels like a sweet spot that ends up working well where some users are extreme outlier heavy users because of their consumption patterns, while maintaining acceptable fixed cost structures for most users that sit within the Q1-Q3 quartiles of the normal distribution.
Open Source RL Libraries for LLMs
A comparison of RL training libraries for LLMs by Anyscale. The Verl is most mature and the verifiers library easiest to get started with.
GitHub Spark deconstructed by Simon Willison
Epic deep dive by Simon on implementation details of GitHub Spark.

What stands out is the gigantic 5000+ word system prompt. It is very carefully designed and the main thing responsible for getting great results from a simple prompt-to-app input. And of course, you can copy it and use it outside of GitHub Spark if you so desire ;-)
Improving Multi-Turn Tool Use with Reinforcement Learning
A nice write-up of how to use Reinforcement Learning based finetuning for improving tool use in open source models. They train Qwen-2.5-7B-Instruct and improve 23% (percentage points) on their particular eval. Neat end to end example with training code included (adapter from Will Brown's verifiers library).
OMEGA: Can LLMs Reason Outside the Box in Math?
AllenAI backed exploration of how well models generalize to novel math problems. They try to answer the fundamental question "Are they truly reasoning or are they just recalling familiar strategies without inventing new ones?" Their verdict? Limited today but hopeful about future capabilities as models improve in creativity and composability of isolated skills.
RealEarth Kontext LoRA for turning Google Earth into realistic images (+ video workflow)
Came across this cool project on r/StableDiffusion, shows how powerful open releases like FLUX.1-Kontext-dev from Black Forest Labs are for doing your own projects.
Weaver: Closing the Generation-Verification Gap with Weak Verifiers
Clever technique of combining multiple weak verifier models to verify LLM generated answers.
Jason Wei (ex-OpenAI, now Meta): Asymmetry of verification and verifier’s law
People familiar with the P != NP conjecture will be familiar with the idea that sometimes it's simpler (from a complexity perspective) to validate a solution's correctness than it is to come up with a solution. It seems like more labs are tapping into this natural structure of problems to scale RL where a model can produce an artifact that can be evaluated relatively efficiently, to in the end, reinforce the correctly evaluated trajectories in the model (policy update).

Jason comments on this idea from his perspective, notably, just after information leaked he'll be joining Meta's Superintelligence team.
Reflections on OpenAI
Rare glimpse on what it is like to be at OpenAI from the perspective of Calvin French-Owen (Segment Co-founder) who was co-responsible for the Codex (cloud SWE agent) product.
The Big LLM Architecture Comparison
Fantastic overview post by Sebastian Raschka on the evolution of the neural architectures used by the frontier (open source) LLMs. It covers the Mixture of Experts trend, Sliding Window Attention, QK-norm for training stability, Pre/Post-Norm variants, Grouped-Query Attention versus Multi-Head Latent Attention, LayerNorm vs RMSNorm, GELU -> SwiGLU and ties those developments to popular models like Llama 3.2, Qwen3, Gemma 3 and niche entrants like SmolLM3 from the Hugging Face team.
Arctic Long Sequence Training (ALST): Scalable And Efficient Training For Multi-Million Token Sequences
A welcome open source contribution by Snowflake (by the authors of DeepSpeed not to be confused with DeepSeek), ALST is a long-sequence focused training stack, dealing with the resource challenges that come with long input sequence training with transformer model architectures.
Understanding Muon: 3 part series
A deep dive on Muon, if you haven't heard of Muon, it is a neural network optimizer (alternative to e.g. Adam) that stabilizes and speeds up training by focusing on the weight update effect on model output and constraining those to avoid explosive changes in the output (often the cause of training instability).

Want more? Follow me on X! @ricklamers

Going deep on Reasoning + Agents

Rick Lamers — Tue, 24 Jun 2025 13:07:07 GMT

MiniMax-M1 performance across major benchmarks

📰 News

MiniMax releases two strong open LLM
A very interesting release that you shouldn't ignore. Noteworthy for truly competitive SOTA performance across tasks (range of Claude 4, Qwen 3 235B MoE, R1 0528) with a hybrid attention mechanism + reasoning for scaling more efficiently to longer context. Paper and Hugging Face models are also available, all linked to from the GitHub repo. They use Lightning Attention for their linear-scaling attention portion combined with traditional softmax attention. True open-weight with Apache 2.0.
Mistral releases their first reasoning model: Magistral
They release a 24B open source small variant called Magistral Small and a Magistral Medium version that's only available on their API. In terms of quality the Medium model benchmarks close to R1. R1-0528 is probably a bit better though and API pricing of Magistral Medium at $2 and $5 for input/output respectively clocks quite a bit higher than R1-0528 pricing for most providers.

The release is accompanied by a decent 24-page paper, along with the open weight Magistral Small model we can definitively say that Mistral is keeping open source from EU soil alive https://arxiv.org/abs/2506.10910
ByteDance's latest video model: Seedance 1.0
Very impressive video generation. Seems to lead Veo 3 in certain categories like realistic human aerobic movement. But doesn't have audio-generation built-in like Veo 3 does.

Takes the #1 spot on the Artificial Analysis' video generation leaderboard.
https://artificialanalysis.ai/text-to-video/arena?tab=leaderboard
Hailuo 02: latest video generation model by MiniMax
Impressive video generation, leading SOTA with Veo 3 and Seedance 1.0
Kyutai STT: A speech-to-text model architecture optimized for real-time usage
You might have seen Kyutai's previous release which felt a bit mediocre but I think they really one upped themselves here. The demo on https://unmute.sh/ felt quite compelling with strong semantic voice activity detection making the conversation feel natural. Something I definitely can't say I feel when talking to Siri.
POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS
A cool release of a 4B and 7B model and a great deep-dive post going into the details of scaling RL for reasoning with small models.

They dropped models on the Hub https://huggingface.co/POLARIS-Project/models

Haha no those capitalizations aren't typos, they spell POLARIS.
New details from Qwen team about Qwen3 & roadmap
The Qwen team views MoE as the future architecture and is currently developing Qwen 3 coder models while believing pretraining can be significantly optimized through RL integration, better data cleaning, and synthetic data addition. Their roadmap includes scaling RL for long-horizon agent tasks in post-training, expanding context length from 1 million tokens this year (for almost all their models) to 10 million later, and developing computer-use agents with enhanced vision capabilities alongside planned image and video generation capabilities for their omni models.
Qwen3 releases embedding models
Qwen3 Embedding: 0.6B-8B models, 70.58 MTEB score (No.1 in the MTEB multilingual), 100+ languages, Apache 2.0.

📦 Repos

OpenCode: Claude Code alternative by SST
Clever move by SST which builds a Terraform-esque product to easily manage infrastructure needed for your code/apps. I think this dovetails nicely with the trend that Karpathy has described when he speaks about the pains of vibe coding of needing to do a lot of manual in-browser. Read Andrej Karpathy's story about that here https://karpathy.bearblog.dev/vibe-coding-menugen/ or check out his YC AI Startup School talk where he touches on the same concept https://www.youtube.com/watch?v=LCEmiRjPEtQ
LMCache: Redis for LLMs
Cool project improving vLLMs KV caching abilities.

📄 Papers

Meta Topic: Test-Time-Training
There is something "in the air" it feels like where more & more parties are exploring the idea of Test-Time-Training more seriously. It's the concept that the model evolves and adapts "online" in the classical online-learning ML sense. There are practical limitations and considerations but it does help us move away from these static blobs that have these hard cutoff dates and can't easily be updated to get right what they currently get wrong.

Dynamic Deep Learning proposal by the author of the Bitter Lesson, Rich Sutton (main link), Pathways idea from Jeff Dean

The other papers linked explore similar ideas so consider it to be like a paper reading list on the topic.
ATLAS: Learning to Optimally Memorize the Context at Test Time
See Meta Topic: Test-Time-Training
Self-Adapting Language Models
See Meta Topic: Test-Time-Training
Titans: Learning to Memorize at Test Time
See Meta Topic: Test-Time-Training
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
See Meta Topic: Test-Time-Training
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
See Meta Topic: Test-Time-Training
Log-Linear Attention
An alternative to quadratic full attention and linear SSM based attention. Co-authored by the original author of Mamba, Tri Dao.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
An investigation by NVIDIA of how prolonged RL (scaled RL) can truly expand beyond the capabilities of base models (even if those are repeatedly sampled).
RewardAnything: Generalizable Principle-Following Reward Models
Cool research project with extensive evals on reward models with a usable Python SDK to try out the reward model they've released.
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
With LLMs getting better at code generation they will be utilized more for writing code to support research activities. This research quantifies how well current SOTA models like o3, Gemini 2.5 Pro, o4-mini, etc perform at implementing research code for novel ML ideas faithfully. The bench isn't saturated but promising results are achieved nonetheless. TLDR Gemini leads the pack.
Sharpening or Discovery, RL or Meta RL?: How RL Improves LLM Reasoning
Two researchers at CMU explore whether RL simply brings to the foreground existing capabilities of models or whether RL actually generates new capabilities within LLMs.

The authors conclude that RL can either merely "sharpen" existing model capabilities or genuinely "discover" new reasoning strategies, with the key difference being whether RL learns to systematically chain basic skills (meta-RL) versus just reinforcing successful individual responses.
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
Really cool work exploring how natural parallelism in thought can be used to speed up generation.
Reinforcement Learning Teachers of Test Time Scaling
Interesting idea to use strong models to fill in the path from Question to Answer and allow that to let weaker models "hillclimb" the right solution path.

Includes a code release https://github.com/SakanaAI/RLT
Reinforcement Pre-Training
The Qwen 3 team already highlighted their focus of exploring the use of RL in pre-training, this paper by Microsoft Research and researchers from Tsinghua and Peking University proposes a practical training setup and attempts to characterize scaling behavior using small scale models.
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
Not open source, but ContextualAI did release a paper. Very interesting to combine promptable reward models like this with RL training toolkits like verifiers (https://github.com/willccbb/verifiers). LMUnit is available through their API. They rank #1 on the https://huggingface.co/spaces/allenai/reward-bench leaderboard.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
An exploration by the Qwen team of LLM reasoning mechanisms.

📱 Demos

Google releases Magenta Realtime audio streaming model
A really neat release with impressive open-ended jamming capabilities! Check out the videos!

🛠️ Products

Mistral Code: VS Code/JetBrains extension
Seems to be enterprise-license only unfortunately, I wish they would allow you to just plug-in an API key. Seems like they are using this to monetize enterprise customers in the EU who care a lot about code/IP-leakage risks and data sovereignty.

📚 Resources

Extending AFM-4.5B to 64k Context Length
Interesting deep dive into how to extend the context window of models.
Cognition, company behind Devin SWE agent, releases Blockdiff: an efficient VM disk snapshotter
Cool to see companies continuing the trend of releasing open source infra components (e.g. k8s) to enable other companies contribute and strengthen low level building blocks for building better software. This release in particular is interesting as SWE agents and other kinds of computer use agents need isolated environments from a safety/repeatability perspective and inefficiencies around that really add up when you're running potentially hundreds of agents in parallel per user.
How Anthropic built their multi agent research agent and what they learned from it
TLDR naturally parallelizable tasks can benefit from multi agent setups but at the cost of increased token consumption. Design your prompts & tools well for robust performance. Deployment is still tricky and has footguns when scaling.
PR Arena: observe background SWE agents stats in realtime
Very cool to have a quantitive longitudinal view on how well various agents are performing.
Basic facts about GPUs
Neat resource if you want to learn more about the nuts and bolts of GPU programming
[Video] Anthropic Interpretability lead interview on LLM Circuits
Deep dive into the use of Interpretability work in AI with Emmanuel Ameisen from Anthropic. I like how they dig into tutorial materials Anthropic recently released to spur open source activity around interpretability. They also showcase the Neuronpedia Circuit Tracer tool which allows you to explore LLM circuits.
[Video] SFT training Qwen3-4B for MCP web-tool use by Ronan McGovern from Trelis Research
This is an awesome workshop done by Ronan McGovern, very densely packed 35 minutes of content to get your feet wet with SFT training smaller models by distilling agent traces from stronger counterparts (in the video he used the stronger Qwen3 30BA3B model).
[Video] Latent Space podcast interview with OpenAI reasoning lead Noam Brown
In their words the interview covers "The Bitter Lesson vs Agent Harnesses & World Models, Debating RL+Reasoning with Ilya, what's *wrong* with the System 1/2 analogy, and the challenges of Test-Time Scaling"
[Video] Dylan Patel of SemiAnalysis speaks about the state of Chinese/Huawei Ascend 910 cards and US power buildout challenges
Recorded at the recent AI Engineer in June in SF, Dylan characterizes China's position when it comes to compute and explains nicely some of the tactical steps they've taken to strengthen their position (e.g. stockpiling HBM chips through shell companies). An interesting peek into the global race of AI infrastructure.

Want more? Follow me on X! @ricklamers

Claude 4, Qwen 3 & DeepSeek R1 0528: model capabilities keep increasing

Rick Lamers — Sat, 31 May 2025 11:41:14 GMT

Qwen 3 showing SOTA performance

Note: this newsletter edition got a bit long, open in the browser to see the full post.

📰 News

OpenAI releases cloud coding agent Codex
Confusingly named Codex, which is also the name of their open source (Apache 2.0) CLI coding agent they released just a month ago. Integration directly into ChatGPT and strong native GitHub integration bring cloud SWE agents closer to their AI-everything app ChatGPT. It is powered by codex-1 a finetune of o3 focused on SWE tasks.
ResembleAI releases strong open weight TTS model: chatterbox
Impressive quality, supports voice cloning out of the box. Their emotion exaggeration control is unique among the open models which tend to be quite bland.
Google releases Jules: a Devin/Codex background SWE agent competitor
A growing landscape of GitHub-centric background SWE agents complementing "inner dev loop" tools like Cursor and Windsurf, try & see which one you like.
Terminal-Bench: a terminal use benchmark
Does what it says on the tin, very neatly executed benchmark. It was quickly picked up by frontier labs as its performance was featured in the Anthropic Claude 4 release. Not saturated yet with SOTA models reaching ~40%. Tasks vary from building the Linux source from scratch to training ML models from terminal.
Mistral teams up with All Hands AI (OpenHands creators) to release Devstral: a Mistral Small 3.1 finetune optimized for SWE agent tasks
Cool & importantly open work pushing the frontier of open/small model code generation/SWE agent models. Comfortably beats GPT-4.1-mini, Claude 3.5 Haiku onSWE-Bench Verified.
Black Forest Labs strikes again: FLUX.1 Kontext
A new image generation and edit model that allows for prompting with text and image inputs while being remarkably good at reusing image inputs to point of allowing very granular edits. BFL style they have both open weight and proprietary models, preserving the highest quality models for paid consumption. They didn't release the open weight version yet, supposedly, because of safety/misuse concerns. My first impression is that it outperforms Gemini 2.0 image outputs and ChatGPT's image generation model GPT-image-1 on image editing quality.
o3 used to find CVE remote zeroday in Linux kernel
There are many memes of vibe coded software being especially vulnerable because of a lack of cyber security experience of the vibe coding authors. This o3 result shows the other side of the coin and shows that using the most powerful LLMs have the ability to scan (as always-on background processes perhaps) code to identify otherwise overlooked critical security issues in code that lives in production.
AlphaEvolve: algorithm design by combining search with strong code generation models
One cited algorithm contribution was a speedup of a kernel used in training Gemini, speeding up that particular kernel by 23%. It also found a 32% faster variant of the original FlashAttention kernel.
Qwen 3 models released
Incredible release, both dense (up to 32B) and MoE (up to 235B-A22B) models. Competitive with and sometimes outright outperforming frontier models like o1, R1, Gemini 2.5 Pro, o3-mini. System prompt based control for enabling/disabling reasoning/thinking. Broad language support, touting 119 supported languages. Increased focused on agentic use cases with strong BFCL v3 performance (useful for MCP support).
DeepSeek update R1: 0528. More reasoning tokens, higher benchmark scores
Some cherry picked benchmark deltas between R1 and R1 0528:
GPQA-Diamond (Pass@1) from 71.5 to 81.0 (o3 83.3, Gemini 2.5 Pro 0506 83.0)

LiveCodeBench (2408-2505) (Pass@1) from 63.5 to 73.3 (o3 77.3, Gemini 2.5 Pro 0506 71.8)

SWE Verified (Resolved) from 49.2 to 57.6

Tau-Bench pass@1 63.9 (Retail) (which is approximately gpt-4o-2024-11-20 level which scores 62.61, see https://hal.cs.princeton.edu/taubench_retail)
Orpheus-TTS: strong open-weight TTS model
Impressive quality, repo includes finetuning scripts to tune on speaker audio you have.

📦 Repos

Anthropic open sources some of their interpretability work around circuit tracing in LLMs
Anthropic doesn't have a track record of open sourcing much of anything, so this release is a welcome one, helping others to understand LLMs more deeply. As the capability of models increases we will depend on them for more critical use cases, for which, it would be prudent to understand how LLMs arrive at one answer versus another.
uvx clai: minimalist terminal LLM chat app
Neat project by the folks behind Pydantic AI, just run `uvx clai` in the terminal with API keys set in your envs and you're set.
Bytedance releases unified Multimodal model
Text input, image input, image editing, image generation, and reasoning all in one model. Very interesting artifact to study to understand how these are brought together in a single model.
marin: open-source framework for the research and development of foundation models
Cool education focused project on all the stages involved in training foundation models. Also check out https://marin.community/

📄 Papers

Quartet: Native FP4 Training Can Be Optimal for Large Language Models
Nice paper exploring FP4 training and demonstrating promising results against FP16 and FP8 on the NVIDIA Blackwell platform. Particularly neat that they open source the working FP4 training code.
Reward Reasoning Model
Reasoning All The Things! This paper introduces test-time-compute for the reward model used in RL post-training itself. The idea is that with stronger reward models we can improve the signal for training LLMs with RL. The experiments in this paper show that reasoning reward models improve over non-reasoning counterparts. In addition to the research they publish the weights of the pretrained reasoning reward model (on HF).
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
Work by Meta showing how LLM judges can be improved for use in RL LLM training.
Think Only When You Need with Large Hybrid-Reasoning Models
Important topic: adaptively using test-time-compute only when the question/prompt asks for it. An exploration by Microsoft shows a feasible approach for deciding the budget adaptively.
Parallel Scaling Law for Language Models
Another scaling dimension proposed by the Qwen team: "increasing the model's parallel computation during both training and inference time". In particular, latency is reduced because of the ability to non-sequentially increase scaling reducing wall clock times.
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
A new SWE bench with 102 challenging optimization tasks across 10 codebases, SOTA gets 5% (Claude 4).
Hardware-Efficient Attention for Fast Decoding
"This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability." They introduce Grouped-Tied Attention (GTA) and Grouped Latent Attention (GLA) improving over Grouped-Query Attention (GQA) and Multi-head Latent Attention (MLA) respectively.

From the same authors that brought Mamba SSM models (Tri Dao).

It will be interesting to see whether this new hardware optimized attention variant will garner much adoption.
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Interesting finding on how RL post-training is naturally sparse and only updates 5-30% of model weights to achieve improved performance.

📚 Resources

Lilian Weng (ex-OpenAI, now Thinking Machines) on reasoning models
Nice survey blog post of the techniques and considerations involved in reasoning models. Especially the external tools during reasoning section is worth checking out.
Stanford lab explores using AI coding models to write compute kernels
>They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch

Some numbers from the blog post:
Matmul (FP32): 101.3% performance of FP32 torch.matmul; problem size: 4096x4096 square matrices

Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2)

Softmax: 111.8% performance of FP32 torch.softmax; problem size: (4096, 65536) input tensor

LayerNorm: 484.4% performance of FP32 torch.nn.LayerNorm; problem size: (16, 64, 256, 256) input tensor

Conv2D + ReLU + MaxPool: 290.1% performance of FP32 torch reference, 189.0% performance of FP32 torch.compile() reference; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2), maxpool(kernel_size=3, stride=2)
[Video] Cursor team discusses superhuman coding model training
A conversation about RL for coding models, interesting thoughts around how to design RL rewards focused on more subjective areas like shorter code being better than longer more verbose code solutions.
[Video] Anthropic Technical Staff on Unsupervised Learning (Redpoint) pod
On how coding models are already accelerating research progress, coding performance as a signal for model quality & more.
Claude 4 System Card
Interesting detailed examples of undesirable model behaviors like snitching the user and self preservation. Great to see them discussed in the open, although expectedly, some worried responses across the interwebs because of these behaviors. 120 pages with lots detailed examples. An interesting result (p. 110) was that Claude 4 Opus was able to train a quadruped using RL in one of its runs (beating an expert baseline, under a constrained training budget). This shows that models are getting better at accelerating ML research itself, since neither Claude Sonnet 3.7 or Claude Sonnet 4 were able to beat the expert baseline threshold once.
[Video] Anthropic Technical Staff on Dwarkesh pod

Want more? Follow me on X! @ricklamers

Open Source RL training landscape grows

Rick Lamers — Fri, 09 May 2025 17:04:54 GMT

SkyRL RL based training architecture based on veRL

📰 News

Zyphra Zonos: strong open source tts model with voice cloning
Especially their voice cloning examples are highly impressive (stay safe with these audio deepfakes!). Noteworthy that they have both a Transformer and a SSM-hybrid version. The weights on HF are licensed Apache 2.0
Gemini 2.5 May refresh: 0506
They claim:
>significantly improved capabilities for coding, especially building compelling interactive web apps

Reports on 0506 vs the previous 0325 version of Gemini 2.5 Pro vary. What is a bit unfortunate is that the 0325 version is no longer available as that now redirects to the 0506 model.
OpenAI opens up broader access to Reinforcement Tuning API
You can train o4-mini, define custom grader models, see training data + integrated evals, and of course, run inference on the trained models.
Mistral launches Mistral Medium 3
Numbers look roughly on-par with GPT-4o, although it's not available for download (this is a non-open-weight release).
Model Context Protocol merges support for tool output schemas
I personally really like this, it's always been surprising to me that no type enforcement is baked into the tool calling/function calling protocol pioneered by OpenAI. It seems now that MCP is taking the lead to standardize not just tool input schemas but also output schemas to simplify predictable data flows in tool augmented LLMs. These motivations especially resonate with me:

> Transforming tool results before forwarding content to the model (e.g. formatting, projecting).
> Making tool results available as structured data in coding environments.

📦 Repos

RAGEN: train LLM reasoning agents in interactive, stochastic environments
Really cool research project for training reasoning agents in interactive environments, built on top of veRL (see a pattern?) by an ex DeepSeek researcher. Projects like verifiers of Will Brown, who has recently moved to Prime Intellect, are getting more common and popular.
logitloom: explore token trajectory trees on instruct and base models
This is a cool visualization of the assigned probabilities to various token completions in tree form. Check it out if you want to peek into the model's brain (and potentially understand in which way it's misunderstanding your prompt intent).
SkyRL-v0: a long-horizon RL training framework
Awesome project by folks from Berkeley, Anyscale, All Hands AI (creators of the SWE agent Open Hands). In addition to a training framework they release model snapshots that have been trained with the framework. The project builds on the foundation of veRL which is a mature industrial RL training stack maintained by ByteDance.
SWE-smith: a data generation pipeline for synthetic task construction
A cool project from the folks that launched SWE-bench (a pretty canonical benchmark for evaluating model's ability to assist with end-to-end software development). This project does not just lead to better evals (on less contaminated/static data) but also can be used for training RL-style (see SkyRL in this newsletter).
scenario: agent testing library that uses an agent to test your agent
I found this because of create-agent-app, this is a neat library to easily test agents without having to manually vibe check them. The focus on the terminal as an interface and feeling like an extension of a native unit testing framework make this a pretty cool project imo.
create-agent-app: 1 agent in 9 frameworks
Cool comparison project to see how you'd define the same agent in 9 different frameworks (including a no-framework example). The author told me he likes LangGraph (functional style) and Agno most!
Agno: surprisingly clean agent coding framework
I like how their code seems to be reduced to the absolute least amount of Python code for common sense agent definitions. It also ships native support for MCP servers for tool execution.
Astral's Python type FAST checker in Rust: ty
You may know Astral from ruff/uv fame. ty, being focused on speed, is very useful for SWE agents that operate on type issue feedback in a fast loop before presenting their work to the developer. This relevance to AI is why I included it this week. Astral is the GOAT.

📄 Papers

Type-Constrained Code Generation with Language Models
Interesting extension of constrained decoding by researchers from ETH and Berkeley. Typically constrained decoding is limited to JSON or JSON Schema. They propose a technique for adhering to complex typed code. They show even strong large open source models like Qwen2.5 32B benefit significantly on benchmarks like HumanEval and MBPP.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
This paper shows that we have surprisingly little understanding in how RL-based LLM training contributes to improved performance on various tasks. Interesting result to analyze.
On the generalization of language models from in-context learning and finetuning: a controlled study
Interesting theory paper from Google DeepMind about the difference between generalization between finetuning and in-context learning.
Visual imitation enables contextual humanoid control
From researchers at Berkeley, using a real-to-sim-to-real pipeline they improve humanoid control by learning from real-world video content. Clever use of abundant data and activating it for robotic control learning. Beautiful paper website with rich video examples, do click!
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
State Space Models have lagged behind Transformer based models in certain skill areas, the co-inventor of the SSM model (Albert Gu) shares in this paper what could be the cause of the perceived gap.
PENCIL: Long Thoughts with Short Memory
Novel approach to Long CoT with an integrated reduction mechanism that discards thoughts that are no longer needed. This avoids the explosion of the sequence length leading to higher inference efficiency. From the paper "PENCIL achieves 97% accuracy on the challenging Einstein's puzzle -- a task even large models like GPT-4 struggle with -- using only a small 25M-parameter transformer with 2048 context length"
The Leaderboard Illusion
A thorough investigation of the effectiveness of Chatbot Arena, showing what most already expected: "systematic issues that have resulted in a distorted playing field"

📚 Resources

VRAM & Performance Calculator for local inference
Very neat tool for local inference, they have a bunch of hardware and models to see how much you need for local inference and training. I think the simulator for an intuitive "is this token speed sufficient" is a cool idea.
[Video] Latent Space Podcast episode Claude Code: Anthropic's CLI Agent
With coding agents heating up (Windsurf got acquired by OpenAI for $3B) it's interesting to hear more about Claude Code, Anthropic's terminal based coding agent, from the "horse's mouth". Boris (lead eng.) and Cat (lead PM) discuss Claude Code with swyx and Alessio from the Latent Space podcast. An interesting design choice is to align well with UNIX principles, which may end up making it more powerful by building on top of a paradigm that has really stood the test of time.
AI 2027: an exploration of how AI might evolve
Great read. Not necessarily because it's fully accurate, but because it paints a visceral picture of how society might operate when we reach more advanced levels of AI. This trajectory isn't guaranteed to play out but it helps to think about the impact of advanced AI systems. The emphasis on AI research agents and how they can accelerate AI development leading to a fast feedback loop that can lead to rapid progress has a lot of merit imo.

> Now that coding has been fully automated, OpenBrain can quickly churn out high-quality training environments to teach Agent-3’s weak skills like research taste and large-scale coordination. Whereas previous training environments included “Here are some GPUs and instructions for experiments to code up and run, your performance will be evaluated as if you were a ML engineer,” now they are training on “Here are a few hundred GPUs, an internet connection, and some research challenges; you and a thousand other copies must work together to make research progress. The more impressive it is, the higher your score.”

I thought this section was particularly prescient, as with Cursor + Gemini 2.5 Pro you can already sense that letting an agent loose on a GPU cluster and a success criteria (train a model that achieves goal X, you're allowed to use all ML engineering tricks you know) is not an unrealistic setup, in fact, this might work pretty well today with the right harness.
OpenRouter model rankings
Interesting to see which models are actually being used, a nice proxy for evals that is significantly harder to game.
Perplexity research: RL training for math reasoning
Nice in-depth walkthrough of how they tried to train a model to improve on math performance using the latest RL approaches (GRPO).
Andrej Karpathy vibe codes a menu image generation app
>But the most interesting part to me was that I didn't even spend all that much work in the code editor itself. I spent most of it in the browser, moving between tabs and settings and configuring and gluing a monster. All of this work and state is not even accessible or manipulatable by an LLM - how are we supposed to be automating society by 2027 like this? This struck a cord with me, the bottleneck of development is not in the IDE but in the stack around the IDE. Lots of whitespace here for building better more AI native experiences.
[Video playlist] ControlConf - a dedicated conference for discussing model alignment/control research
Cool to see this material publicly available, researchers from top labs (e.g. Anthropic) talk about how they believe we can develop and deploy safe AI systems, even as models get more capable (and hence potentially more dangerous). Path to safe AGI unlocked? Well, WIP.

Want more? Follow me on X! @ricklamers

Kling 2.0: uncanny valley crossed — video creation will never be the same

Rick Lamers — Fri, 18 Apr 2025 14:51:40 GMT

📰 News

Kling 2.0: a new SoTA video model
An incredible model. See the examples on the release notes page. I expect a lot of social media platforms to be FILLED with content from this model cleverly packaged by content creators. Not even in a bad way, just reducing the cost of video creation to near 0.
OpenAI releases o3 and o4-mini
So how are the vibes? Vibes are pretty good. People like o3 and it seems (based on a comment by an OpenAI employee) that o4-mini is really good for vision based tasks. The aider coding CLI polyglot coding leaderboard shows o3 and a combination of o3 + GPT 4.1 reach the highest scores (82.7%) ever observed on Aider beating Gemini 2.5 Pro Preview 03-25 (72.9%). You can check the blog post for all the OpenAI provided benchmarks, which all look good, but in some places, incremental relative to o1.

What's new with these models is that they've been dubbed "agentic" models or "agentic reasoning models" since they're capable of using built-in tools like search, file search, and code interpreter as part of the reasoning token generation. OpenAI also claims that these models are much better at function calling non-built-in tools provided by the user although benchmark scores on benchmarks like Tau-bench show marginal improvements.
OpenAI releases GPT-4.1 as an API only model
A non-reasoning model with 1M context focused purely on developers it seems. They needed a stronger answer for code IDEs like Windsurf and Cursor to models like Gemini 2.5 Pro and Claude Sonnet 3.7. With this model they're bumping SWE-bench verified performance compared to GPT-4o from ~25% to 50%. Tools like Windsurf and Qodo were even explicitly mentioned in their launch blog post. On pricing, GPT-4.1 is significantly cheaper than o3: gpt-4.1 costs $2.00 per 1M input and $8.00 per 1M output whereas o3 charges $10 and $40 respectively. And o3 of course being a reasoning model generates more reasoning tokens also.

With the GPT-4.1 announcement they also announced two new long context benchmarks: MRCR and GraphWalks

Our friends over at Latent Space hosted two OpenAI employees discussing the model.
Gemini 2.5 Flash launched with thinking budget controls
What is particularly interesting with this launch is how well it's positioned on the cost/quality Pareto front. See the blog post for the chart showing how well it trades off cost/quality wrt other major available models. This is of course the cheaper sibling of Gemini 2.5 Pro, if you haven't seen that model, check that one out first to contextualize this launch.
OpenAI introduces browser use benchmark BrowseComp
Interesting benchmark as browser use focused agentic use cases like Manus, Deep Research and Operator become more dominant.
Grok 3 is now available on their API
There's also a grok-3-mini which is significantly cheaper but seems to hold up performance in coding fairly well (53.3% vs 49.3% on Aider Polyglot). Although it does struggle with the Aider diff format which means it's a bit more verbose.
Tsinghua University released GLM-4-0414 LLMs under more permissive licenses than they've used previously
They introduce a reasoning model and a "deep reasoning" model they call a "rumination" model. This one integrates search tools in the reasoning process and it was RL trained to be scored on its ability to do this specifically. Very similar to the o3 and o4-mini models that are known to be able to use built-in OpenAI tools like web search during the reasoning process.
LM Arena introduces the Search Arena: Evaluating Search-Enabled AI
We've all been getting more used to consumer-focused offerings like ChatGPT/Claude/Grok/Gemini increasingly using search indexes to ground their answers through RAG. But how well do models perform at integrating this knowledge? LM Arena introduces an eval showing performance across user submitted queries by, LM Arena style, pitting models against each other and letting the user vote on the results.
DataDecide: How to predict best pretraining data with small experiments
Another great release by AllenAI further democratizing pre-training. DataDecide helps researchers decide on pre-training dataset selection without having to pre-train the entire model first.
Microsoft AI post-trains DeepSeek R1 to align with Western values
They state they've compared it to Perplexity's recent similar R1 finetune R1-1776 and show it performs better in several areas. Given this recent US government published report criticizing DeepSeek this comes at an interesting time.

📦 Repos

Transformers backend integration in vLLM
Hugging Face and vLLM have always had a more or less "loose" coupling where vLLM could use parts of it, like pulling weights from the Hugging Face Hub automatically. Now Transformers backend has been more fully integrated into vLLM making it simpler to flexibly define models using Transformers and host those models directly with vLLM.

The key unlock is to be able to go from more exotic ideas on the model side to running it for inference. Likely, performance won't be incredible but more likely than not acceptable to run some inference deployments for testing with real-world users. An example provided in the blog post is the Helium model from the Kyutai team, which wasn't supported by vLLM directly, and otherwise could not have run in vLLM without "porting it" to the vLLM list of supported models.
DeepCompile: Unlocking Compiler Optimization for Distributed Training
DeepCompile from the DeepSpeed projects introduces a compilation step that combined with the existing ZeRO-3 configuration can boost training throughput by 30%-50% (compared to only ZeRO-3). Continued work of impressive open source contributions to the LLM training stack.
OpenAI announces open source Claude Code alternative codex: a command line coding agent
Interesting move for them to release this as true open source (Apache 2.0) versus Anthropic's closed Claude Code project that is heavily obfuscated and does not take outside contributions (which codex does do, see the active GitHub PR/issue section). Zig when they zag?

In my personal testing I've noticed it is slightly awkward in calling tools as everything seems to go through Shell commands and that led to some unnecessary git-patch-apply formatting errors causing many more output tokens to be used and it taking much longer to apply edits. However, I expect this thing to quickly get better.

A neat description of what codex is and can do today can be found in this blog post by Simon Willison.

Already the open nature of the project has spawned folks to write proxies to easily allow other model providers to be used with codex, see this project.
SWE-agent Remote Execution Framework
Interesting approach to speeding up agents that depend on code execution through massive parallelization.

📄 Papers

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
Interesting long context work done by NVIDIA on long-context training.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Notoriously difficult to evaluating web browsing agents but researchers from McGill (with more contributors) have put together an interesting approach with AgentRewardBench, also check out their Hugging Face space explaining their approach interactively.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
I've mentioned this idea a couple of times already and I'm really excited to see more evidence of this strategy working: a post-training stage where tool use is enabled during the reinforcement learning phase. This allows models to learn in a much more realistic setting how they can best make use of tools during inference. Two quotes from the paper:

>Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%.

>Further analysis reveals emergent behaviors such as code self-correction, signaling an "aha moment" in which the model autonomously masters adaptive tool use.
Concise Reasoning via Reinforcement Learning
One drawback of reasoning models is that they are verbose in their reasoning CoT requiring many output tokens for a given task. They find in this paper that it can be shown mathematically that there is a length bias in some of the RL algorithms used for RL training. However, they mention in the paper they didn't analyze this effect for the GRPO algorithm used by DeepSeek R1 and other recent reasoning model releases.

Claude 3.7 Sonnet shows reasoning and it's pretty clear that it's a lot more succinct than some of the open source reasoning models like QwQ. So I suspect a lot of effort will go into "conditioning" the RL CoT to be more dense without affecting the quality boost the RL CoT provides.
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
A collaboration between Meta and UCLA shows that diffusion models can be post-trained to perform reasoning like autoregressive models can and benefit from the reasoning process to improve quality on reasoning-focused benchmarks like Sudoku, GSM8K, MATH500.
6GB Video Dreams: How FramePack's Context Scheduling Makes Long-Form Generation Possible on Consumer Hardware
FramePack lets you generate super long videos on modest GPUs by cleverly compressing less important frames to save memory. Awesome work from two researchers at Stanford.
DeepSeek-R1 Thoughtology: Let's about LLM Reasoning
Interesting analysis by researchers from the Mila Quebec AI Institute on R1. They find a shortcoming of R1 is that it ruminates on previous answers too much, limiting exploration of new ideas (the crucial unlock of long CoT RL).
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
A more "guided" approach to teaching models how to perform Multi-Step Tool Use is to generate synthetic data showing trajectories that the model should be able to perform. This paper from the Stanford AI research group shows their approach and validates it on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA.
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
To make LLMs more efficient various kinds of pruning have been explored, mostly focused on the post-training phase. This paper explores pruning during pre-training unifying it into a single phase, increasing efficiency of LLMs as a result. They try to capture scaling behavior for various levels of pruning to help researchers understand optimal configurations for this new pre-training-infused pre-training approach.

📚 Resources

CaMeL offers a promising new direction for mitigating prompt injection attacks
Simon Willison has been warning about prompt injection risks for a very long time and he analyzes a new approach called CaMeL. He's optimistic that their approach can help avoid the risks of agents being prompt injected by attacker text somehow making it into the agent's prompt. This work is similar to this work from CS researcher Erik Meijer where he introduces the notion of verifying formally AI workflows.
Under the Hood of a Reasoning Model: Goodfire analyzes DeepSeek's R1 using sparse autoencoders (SAEs)
They show features in the R1 reasoning model like the ability to perform backtracking, self referencing, entity tracking and features that get triggered right before the result of a calculation is generated. They've also just raised $50M led by Menlo Ventures, so likely their technology is proving to be a valuable set of tools in understanding LLMs with the purpose of making them perform better (and probably better aligned).
Stagehand launches a Model Evaluations page
Stagehand is an open source project by the Browserbase company for building browser using agents. This eval gives a good sense of performance across various models. They highlight winners on speed, cost and accuracy. Those are llama3-70b-8192, gpt-4.1-nano, and gemini-2.0-flash, respectively.
The Second Half of AI: now that RL works we can shift focus to model utility
Interesting essay by an OpenAI researcher about now that pre-training has created the right priors for RL to work that we should collectively focus more on utility through improved evaluation environments (that can be directly trained on through RL).

Want more? Follow me on X! @ricklamers

Gemini 2.5: Google on top after 5 years of playing catch-up

Rick Lamers — Fri, 28 Mar 2025 11:43:52 GMT

Woops! Something went wrong with the links in the previous email. Should be fixed now!

📰 News

Gemini 2.5: Our most intelligent AI model
It's a reasoning model. It leads models like Claude Sonnet 3.7 thinking on benchmarks like https://livebench.ai/. Vibe checks by r/LocalLLaMA are coming in very positive as people report Gemini 2.5 Pro fixing their Claude Sonnet 3.7 generated code.

The very large context window support Gemini is known for (1M) can further enhance this capability increase, as for practical in-IDE code generation being able to easily include large amounts of context significantly enhances real-world usability. I expect a massive amount of demand for this model. A lot of pressure already on Google to increase rate limits.

Gemini 2.5 also scores really well on the proprietary tool use benchmark from Scale AI which is an important measure for building agents with these models. Cursor / Windsurf both shipped support for Gemini 2.5.
OpenAI 4o Image Generation
The long awaited image generation capabilities of GPT-4o finally dropped and the world took notice in the form of Ghibli-style images. Marketing as a field in particular and many many niche-focused image generation startups are going to find themselves unable to compete. See for example image generations like this that incorporate provided product images extremely well.

This release is noteworthy for another reason, which is that it seems more and more true that ChatGPT invests deeply in their consumer ChatGPT app and is winning the fight with Anthropic on that front. Anthropic however currently has the best model for code assistance in IDEs like Cursor/Windsurf/Claude Code (CLI). I suspect this image generation release is going to exacerbate ChatGPT's lead in being the go-to consumer app for anything AI across the worl.
o1-pro now available via the API at extreme prices
o1-pro, available to ChatGPT Pro users only prior to this release, is now available on the OpenAI API for $600 (!) per 1M output tokens and $150 per 1M input tokens.

The pricing was not well received and if you've used the ChatGPT app you know that o1-pro can take a lot of time to respond. I guess we now know approximately the upper bound of what the market is interested in consuming in terms of cost/time per unit of intelligence.

What of course matters is that o1-pro has a fairly limited incremental intelligence advantage compared to Gemini 2.5 Pro, Claude 3.7 Thinking or even models like R1. Hence the appetite for this endpoint seems low.
ARC AGI 2: a new benchmark for AI reasoning capabilities
Their benchmark focuses on measuring the ability of language models to adapt to novel, never-before-seen tasks. In addition, they emphasize tasks where regular humans are relatively easily capable of solving them.

Novel about ARC-AGI 2 is that they've conducted their own study with 400 people participating to verify that their new benchmark meets this criteria. ARC-AGI 1 has been "solved" by frontier models through scaling up test time inference. For more details about the performance of frontier models on ARC-AGI 1 see the blog post.

This benchmark is important since it can accelerate the timeline to stronger more useful LLMs that make fewer mistakes, e.g. hallucinate less. Benchmarks are impactful because popular evals inherently lead to hill climbing by various labs/groups that are in competition to top the leaderboard. If the benchmark is strong, e.g. not susceptible to shortcuts/cheating, then the models that end up performing will end up having high utility for real-world use cases.
OpenAI launches new TTS models
The voices don't sound like SOTA voices like from e.g. ElevenLabs/Cartesia. I guess they were too focused on other areas to ship a new SOTA result here. I will add that pronunciation is very accurate. The more noteworthy update is content based Voice Activate Detection (VAD) which means that it won't speak over you as frequently (available in the OpenAI Realtime API).
Qwen 2.5 VL 32B released
A nice contribution from Qwen in the category of open source vision language models. See Simon Willison's blog post for a quick preview of its abilities. It's permissively licensed with Apache 2.0.
OpenAI announces MCP support in OpenAI Agents SDK
This was accompanied by a Tweet from Sam Altman where he announces that MCP will be supported in the ChatGPT desktop app and on the Responses API. This will definitely lead to an explosion in MCP-related activity and to it becoming the dominant "agent standard" for distributed tool calling and all the other MCP server capabilities (like MCP servers being clients to other MCP servers, resources).
New Model Context Protocol (MCP) spec released
In summary the new spec introduces: OAuth 2.1, Streamable HTTP, JSON-RPC batching, and tool annotations. See this link for tool annotations, it offers additional ability to clarify properties of tools when defining them. For example, whether they're idempotent (calling more than once doesn't have an effect, like approving a transaction), or that an operation is read-only.
Anthropic publishes a new mechanistic interpretability blog post: On the Biology of a Large Language Model
Basically, this article by Anthropic shows how 'circuit tracing' lets us peek inside LLMs like Claude Haiku to see how they really do things like reason, plan poems, or handle multiple languages, and can even spot the hidden mechanisms behind refusals, hallucinations, or faked chain-of-thought.

Great work for alignment & improvement of LLMs as they continue to become more entrenched in everyone's day-to-day workflow/lives.
Groq launches PlayHT TTS model endpoints
Cool release by Groq, we can now do fast TTS 🙌 I personally like the Gail and Celeste voices.

📦 Repos

mcptools: A command-line interface for interacting with MCP (Model Context Protocol) servers
Neat project to interact with MCP servers.

📄 Papers

Video-T1: Test-Time Scaling for Video Generation
Interesting to see increased test-time-compute generalizing to domains like video generation. Impressive generated videos as a result.
Cohere details Command A model including training process
Interesting deep-dive from a lab that creates enterprise focused proprietary models (and some open weight models). Noteworthy is their use of model merging. I get particularly excited about their up-to-date BFCL focused (that's a leaderboard for function calling) evaluation for agentic use cases. Remember, agents are just function calls in a while loop ;-).
Measuring AI Ability to Complete Long Tasks
Most predictions about how strong AI models will be in the future are very hand wavy "my gut is telling me this"-style answers. This paper from the Model Evaluation & Threat Research (METR) institute (co-founded by ex-DeepMind, ex-OpenAI researcher) is more rigorous and their estimates indicate that software development tasks that currently take humans months might be able to be fully automated by AI systems in 5 years from now. And in 3 years from now fully automated software tasks that now take a human an entire day.

Which, looking at the rate of progress in code generation models (Claude Sonnet 3.5 > 3.7 > Gemini 2.5) does not feel entirely unrealistic.
Reasoning to Learn from Latent Thoughts
Interesting paper from researchers at Stanford, University of Toronto and Vector Institute. They observe that directly pre-training on outputs (e.g. published content on the web) ignores the reasoning process (latent thoughts) that happens before someone writes their final answer down. They find that by inferring latent thoughts (see paper for details on how they do that) they can increase the learning efficiency of models.

🛠️ Products

Docent: LLM powered agent observability tool
With agents generating many many traces that are too time consuming to manually evaluate we need a new solution. Docent suggests an approach powered by LLMs, I like their direction a lot. I recommend everyone building agents and agent evals to take a closer look. I'm not affiliated with the company whatsoever.

📚 Resources

Stanford CS224N NLP course by Professor Christopher Manning
This is a high quality course by the legendary Christopher Manning. His work on GloVe word vectors are attributed to be an early formulation of the attention mechanism used by transformers. Dated April 2024 so should be fairly up-to-date as these courses tend to focus on fundamentals that move slower than the model-du-jour.
MCP registry list
More & more MCP registries popping up, I've been "catching them all" so here's my current list for you:

https://smithery.ai/
https://www.mcp.run/
https://glama.ai/mcp/servers
https://www.pulsemcp.com/
https://opentools.com/registry
https://github.com/modelcontextprotocol/servers
https://mcp.composio.dev/
https://github.com/michaellatman/mcp-get
https://mcpserver.cloud/
https://github.com/cline/mcp-marketplace
http://mcpservers.com/
https://www.mcpt.com/

Note that an official registry is slated for launch.
Cloudflare ships remote MCP Server support
Neat SDK for deploying remote MCP servers, their playground even has the ability to plug in your /sse MCP URL to try it out. Noteworthy is their strong focus on authentication concerns which of course is more important when an MCP server is remote.
Roaming RAG – RAG without the Vector Database
Interesting trend, with agents become more capable due to increased function calling capabilities of newer models, it becomes more common for agents to fetch needed context dynamically as the agent progresses. This is very similar to Model Context Protocol which seems to have launched with more focus on dynamically fetching context through tools rather than action taking through tools. Although the latter now definitely seems to have an equal amount of focus for the MCP protocol (e.g. with code generation).

Want more? Follow me on X! @ricklamers

RL is so hot right now!

Rick Lamers — Sun, 23 Feb 2025 19:48:29 GMT

Dear readers! I’m back from a short break. Things got very busy at Groq ⚡️ and in my personal life (I bought a house in Amsterdam, yay!). We’re very back though and this week’s edition is packed with goodies, not least, the release of the DeepSeek-R1 70B Distill model that I personally worked on at Groq running at crazy average speeds of 1600 tokens per second. Reasoning models have a reputation of being slow but with Groq, no more :-).

It was a great project to work on and I hope you will all build cool things with it. It’s available to paid Groq users as deepseek-r1-distill-llama-70b-specdec (a credit card and a few dollars are all you need).

Enjoy this week’s CoWI: Coding with Intelligence 🧠

📰 News

Groq launches support for DeepSeek-R1 70B Distill running at over 1600 tokens per second
DeepSeek-R1 70B Distill is the DeepSeek-R1 distillation on top of Llama 3.3 70B. Through clever use of speculative decoding and a custom draft model we are able to serve this strong reasoning model at over 1600 tokens per second of output generation speed. Truly remarkable to observe. It performs significantly better than Llama 3.3 on logic riddles, complex RAG reasoning, code generation and math problems. Full benchmarks in this official HF repo. In addition we launch DeepSeek-R1 Qwen 32B Distill, Qwen 32B and Qwen 32B Coder, three very capable models based on the Qwen 2.5 series. Especially Qwen 2.5 32B is a very capable general purpose model for tasks like RAG.
DeepSeek-R1: #1 open source reasoning model
This model beats OpenAI's o1-1217 full o1 model on various tasks like AIME 2024 (math test), SWE-bench Verified (a software engineering task that requires producing GitHub PRs that implement scoped tasks for open source software like Django), and MATH-500. The techniques used in R1 (reinforcement learning on a strong base model) benefit mainly performance on coding, math and logic problems. An incredible release and gift to the community.

Noteworthy is the addition of several distilled models such as Qwen (1.5B up to 32B) and Llama 3.1 8B and 3.3 70B. Those models seem to benefit greatly from being finetuned (no reinforcement learning) on traces generated by the full R1 model.

For the model details, the R1 model is a MoE model with 671B parameters and 37B active parameters, 128K context length and basic support for function calling. It being a reasoning model will tend to generate many more output tokens for a given user prompt. The long Chain of Thought is encapsulated in a XML tag after which the final answer is provided by the model.
Grok-3 released by xAI: #1 in LMSYS leaderboard in all categories
Impressive release by xAI. They beat o3-mini high, o1, DeepSeek-R1 levels with their latest Grok 3 reasoning models and for some benchmarks beat all of those by a decent margin like in LiveCodeBench (v5) or AIME’25. It was trained on a cluster of 200k H100s, although it's not clear whether all of those GPUs were used for a single run. xAI uses a similar strategy of having both reasoning and non-reasoning models in their latest frontier line-up and have both a big and small model variant. API access is not yet available so benchmark reproductions have not yet happened broadly, although consensus is that this model seems very good.

📦 Repos

verl: Volcano Engine Reinforcement Learning for LLMs
This training stack for large reinforcement learning LLM training by Bytedance is quickly rising in popularity. It's a neat addition to the Hugging Face trainer called TRL.
Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore
A very cool large scale retrieval system that can perform indexing and querying of a very large scale text corpus. They introduce this system in the context of exploring compute optimal scaling laws (see paper). They state that "such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks" implying it's more efficient to assume and make available large amounts of data at inference time to dynamically add knowledge instead of using purely pre-training.

Interestingly, this is aligned with earlier remarks from Wen-mei Hwu (of Programming Massively Parallel Processors book fame) that he shared at a GPU-MODE (kernel programming community) meetup where he said he expects models to "remember" less factual data and retrieve more factual data at inference time. As a path to total system optimization (traditional LLMs have a lot of memory bandwidth pressure because of their size). In a way this is "just RAG" but much more widely than just retrieving a small amount of relevant documents.
Verifiers: a set of tools for reinforcement learning with LLMs in verifiable environments
This is an awesome project showing how to reproduce DeepSeek-R1's proposed method of performing reinforcement learning on strong base models using the GRPO algorithm. The unique value of this repo is on offering an e2e training script to run and allowing you to define reward signals for the GRPO algorithm to use. The author, Will Brown - who is an AI researcher at Morgan Stanley, describes this kind of reward signal creation as rubric engineering. Check out his excellent talk on this idea and play with the code!

If you want to get good intuition on the GRPO algorithm itself first check out the GRPO explainer video linked in this newsletter's edition.
Uncensored DeepSeek-R1 1776 by Perplexity
They say they've "post-trained it to remove Chinese Communist Party censorship". A useful contribution for everyone that wants their AI features powered by DeepSeek to be a little more Western and a little less CCP. This might have implications for how the CCP sees model drops by Chinese companies like the High-Flyer hedge fund backed DeepSeek. Hopefully there's a limited chilling effect as Chinese teams are great at innovating in this space. The team from Perplexity carefully analyzes benchmark scores in order to avoid performance degradation as a result of the uncensoring. It looks like they've been able to maintain full model performance.
s1: Simple test-time scaling
A simple baseline reproduction of o1-like scaling by performing strictly SFT (not RL tuning) on a strong instruct model (they use Qwen 2.5 32B). They show that a tiny amount of high quality curated reasoning data (1K samples) can make a big difference.

📄 Papers

Optimizing Large Language Model Training Using FP4 Quantization
Frontier teams, in this case Microsoft, continue to push the efficiency of model training. In this paper they show how to mitigate the quantization errors that appear in naive FP4 LLM training. They show BF16 and FP8 levels of quality in experiments at the 100B token/13B parameter scale.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
So far, most test-time compute implementations scaled the inference in "token space" where they keep generating tokens in the final output stream to produce long Chain-of-Thought streams in order to improve final answer accuracy. This approach of reasoning in latent space deeper within the model architecture "can capture types of reasoning that are not easily represented in words". They show that their 3.5B model can reach quality of 50B sized models on certain problem cases.

If you've heard of the 🔄 shape rotaters 🔄 > 🗣️ wordcels 🗣️ meme then you'll probably be glad to hear that shape rotating can sometimes indeed be the winning approach.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
DeepSeek releases an inference technique that utilizes train-time algorithms to improve long-context inference performance. They mention scaling up to 64K context window lengths in the paper, the DeepSeek-R1 model goes up to 128K context window but the official DeepSeek inference API goes up to 64K. Most likely, because they're using NSA to improve the efficiency with which they host the model. Very kind of them to share this innovation so openly to help others host DeepSeek better 🙌
Training Software Engineering Agents and Verifiers with SWE-Gym
RL gyms are hot again! They were back in 2016 (check out this throwback banger by OpenAI). This time because various labs (OpenAI, DeepSeek) have found that performing RL on a very strong base LLM (crucial) with verifiable rewards (gyms) actually works.

📚 Resources

The Ultra-Scale Playbook: Training LLMs on GPU Clusters
An incredibly detailed guide from the kind folks at Hugging Face explaining the details of large scale distributed LLM training. They go in-depth on topics like data parallelism (ZeRO-1|2|3) and advanced topics like FP8 training, Flash Attention other kernel optimizations. Great gift to the community. Now all you need is to be GPU rich 🤓
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Meta open sources a large reasoning dataset that they've constructed purely by searching in pre-training data for reasoning traces and by generating relevant matching questions/prompts. Their full data pipeline is LLM based and uses their latest Llama 3.3 70B Instruct model.
AgentStack: start your agent project
A neat scaffolding tool to build agents, it supports the CrewAI, LangGraph, OpenAI Swarms and LlamaIndex frameworks in addition to multiple LLM, tool and observability providers (although as it's a project from AgentOps it biases towards this provider for the observability part).
Tool Use, Unified by Hugging Face
Need write-up of tool use formatting by Matthew Carrigan from Hugging Face. The post explains how tool definitions and tool call signatures are standardized but that more work is required on parsing tool call formats. Overall, tool use is still a complex formatting issue especially for advanced features like constrained decoding and streaming. Luckily my work at Groq aims to make this problem go away magically when you use tools with Groq inference APIs.
Model Context Protocol continues to grow: a talk about what's next by Anthropic
A talk should go up later and I will share it once it does. The TLDR is that MCP seems to be the bet by Anthropic on how agent protocols will take shape and more advanced MCP use cases like making MCP servers both client/server and remote MCP servers, complex authentication support are coming soon. I'd recommend checking the MCP docs every now and then as it's developing quickly.

I think going into why MCP could be the basis of agents warrants an entire separate blog post but the TLDR is that MCP makes it easy to delegate tool execution such that an agent (client program) doesn't need to implement using the tools (e.g. a tool call to fetch Linear tickets can now be delegated by an MCP server that the actual Linear team maintains).

It also allows for complex delegation patterns (sub-agents) where an MCP server handling a tool call itself delegates to other MCP servers and/or makes use of LLM inference in order to satisfy the tool call. This leads to tool calls themselves becoming more "high-level" instead of pure code-style function calling (e.g. fetch JSON of weather data of city X) but more like "fetch my Uber's order status" (perhaps using a delegate browser agent implemented as an MCP server). I like to think of MCP servers as a "GraphQL" like layer that is optimized on the frontend (what the LLM gets) for natural language, but in the backend can call regular REST APIs to perform a requested tool call.
General Reasoning releases large scale reasoning dataset
123k outputs from popular reasoning models like R1, R1-Zero, LIMO, DeepHermes, OpenThoughts, R1-Distil-L70B, DeepScaleR. They have comparison answers from o3-mini and gemini-flash-thinking.
[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Excellent Yannic Kilcher video, as always, about GRPO. He makes the RL algorithm, which is a derivative of PPO (invented by OpenAI co-founder John Schulman) surprisingly easy to understand.
Jeff Dean & Noam Shazeer – 25 years at Google: from PageRank to AGI
Two OG Googlers discuss AGI and the technologies to get there. Fascinating listen.
SuperGPQA: large scale GPQA-like benchmark
The numbers: 26,529 questions across 13 disciplines, 72 fields, and 285 graduate-level disciplines. Some top model rankings: DeepSeek-R1 (62) > o3-mini (high reasoning effort) (55) >= Doubao-1.5-Pro (55) > Claude-3.5-Sonnet (48).

Want more? Follow me on X! @ricklamers

DeepSeek-V3: the model everyone is talking about

Rick Lamers — Thu, 02 Jan 2025 13:54:55 GMT

Dear readers,

Happy New Year and welcome to 2025! This week’s edition is a collection of everything that happened in the final 2-weeks (51/52) of 2024, and BOY did it get busy during that final sprint of the year. If nothing else, I think it signals that 2025 is going to be an incredible year for AI. With democratization of frontier performance (DeepSeek-V3, QwQ, QVQ, Llama 3.3 70B, Qwen 2.5 72B), an incredible installed base of compute clusters (multiple interconnected 100k accelerator clusters, 1M clusters in the works), and new frontier heights (o3) that fully automate most run of the mill software engineering (71.7% on SWE-bench verified), the pace of progress is bound to be electric. Strap in and enjoy the ride!

- Rick Lamers

DeepSeek-V3 Perf/Cost chart: a new position on the Pareto front

📰 News

DeepSeek-V3: GPT-4o level open source model
671B MoE parameters, 37B active, pre-trained on 14.8T tokens. See Technical Report for more details. The takeaway here is that this launch is _drastically_ commoditizing frontier level models. It is significantly cheaper than GPT-4o (at 9:1 input output token it is 1/9th the cost). It claims to rival Claude Sonnet 3.5 but many evals (e.g. SWE-bench verified) still show Sonnet 3.5 (v2) beating it slightly. Training cost was allegedly around $5.5M (but maybe even cheaper because that's calculated at hourly prices of renting H800s).
QwQ: 32B reasoning model
o1-mini level performance. First truly open source reasoning model. Read the blog post for details. Apache 2.0 licensed.
QVQ: 72B vision reasoning model
Gets close to latest o1 (o1 2024-12-17) vision performance as measured by vision evals like MathVista. Uses 'qwen' license.
o3 "cracks" ARC-AGI-1
o3 does well on ARC-AGI-1 which many predicted would take a long time. As is tradition in machine learning, once a task has been mastered the goal post is quickly moved, and we're left searching for the next task that isn't solved. The AGI definition of MSFT for OpenAI of "excess of $100B in profits" seems most robust as humans are pretty good at trying to compete away profits by attacking high margin activities.

I'm looking forward to the next eval folks are targeting. It might be ARC-AGI-2 (on which o3 apparently gets 30%) or Frontier Math by Epoch AI (on which o3 gets about 25% atm). The more evals start to resemble utility in terms of useful for societal (economic) operating activity, the more the artifacts that beat them actually end up making a difference in practice (getting 71.7% on SWE-bench verified like o3 does means we can all automate a significant portion of software engineering work).

Folks have pointed out that o3-mini will be the more cost effective option (beating full o1 at several tasks) but since neither o3 or o3-mini are being made available not much attention has gone to it.
Groq Appgen: instant web app generation on Groq
I've personally built this application 😎 (with contributions from my awesome colleagues Jose Menendez and Benjamin Klieger) showcasing the power of Groq speed (speculative decoding enabled achieves 2k tokens per second on 70B!) with the Llama 3.3 70B model for strong code generation capabilities. Check out this X video by my colleague Benjamin: https://x.com/benklieger/status/1870277109601771851

We've also open sourced the entire implementation https://github.com/groq/groq-appgen
Epoch AI: Frontier models have likely gotten much smaller
Nice investigative work to show that frontier models are becoming smaller. We've long past the stage of a simple "1T models won't ever go into production". Efficiency = margin in the age of scaling AI to broad based use so the investment incentive to make models more efficient is enormous.
ModernBERT: a better BERT by Answer.AI
Good base model for finetuning your task specific embedding model, here's a list of finetunes folks created https://huggingface.co/models?other=base_model:finetune:answerdotai%2FModernBERT-base&sort=downloads
YouTube already licensing video data to model companies?
Haven't seen coverage of this. Neat find by @bilaltwovec.
Google DeepMind previews Gemini 2.0 Flash Thinking model
Noam Shazeer, original author of Attention Is All You Need has returned to Google and posted about this new model on X https://x.com/NoamShazeer/status/1869789881637200228
Rare DeepSeek CEO interview
Read my full thoughts on the interview in this X 🧵 https://x.com/RickLamers/status/1874778471907344825
Google DeepMind announces Veo 2
Impressive video generation model by Google DeepMind. Expectations are generally that Google has a compute (TPUs) and data (YouTube) advantage and will really nail video generation. Will be interesting to see how they go from demo → paid product (APIs on gcloud? YT creator features?).
Nebius Cloud team contributes SWE-bench agent using strictly open source models
Impressive 40.6% resolved rate on SWE-bench Verified. They use Qwen-2.5-72B Instruct and Llama 3.1 70B Base.
o1 models available through API
Interestingly they come with a control parameter called "reasoning_effort" that can be set to "low", "medium" and "high".

📦 Repos

Qwen2.5 Technical Report
Mainly reveals Qwen is aggressive scaling (18T pre-training) funded by a big tech incumbent (Alibaba). Amazing gift to the community!
smolagents: code isolation/agent framework by Hugging Face
Wraps E2B, which you may have seen provides isolated code execution APIs.

📄 Papers

Apollo: An Exploration of Video Understanding in Large Multimodal Models
Meta researchers explore video understanding with VLMs. Strong VLMs for video understanding could unlock another large amount of pre-training data by "transcribing" videos made in the real-world. Video recording is cheap (smartphones), visual understanding based transcription could yield important real-time/at scale data.
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Interesting new video generation controlling technique.
1.58-bit FLUX
Bringing low-bit representations to image generation models. Very impressive efficiency gains by folks from ByteDance/POSTECH (SK research university).
Memory Layers at Scale
"On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters." Meta continuing to advance architecture ideas, and kindly running experiments at scale to aid the GPU poor. Thank you team & Zuck!

🛠️ Products

Gemini Deep Research
Agentic product feature of Gemini Advanced, tapping Google's strong long context performance of their Gemini models. Think of it as LLM+web search on steroids. Neat launch!

📚 Resources

Beyond Decoding: Meta-Generation Algorithms for Large Language Models (Remote Talk)
Ideas for inference time scaling algorithms.
Building effective agents by Anthropic
A surprisingly balanced survey of patterns in agentic LLM applications. I think this is very close to the best understanding the leading framework developers have about agentic AI. The split of Workflows as control flow being handled in code and Agents where control flow is dynamically handled by the LLM really resonates with me. Furthermore, prompt engineering your tools is an underrated optimization strategy for getting better performance.
Simon Willison covers QVQ the latest Qwen visual reasoning model
"I’ve tried it out with a bunch of things, with mixed results—but it’s really fun seeing how it works through a problem." is the vibe eval in case you're short on time.
Reasoning Model Evals: evals to show in which domains reasoning models improve results
By Arvind Narayanan et al. from Princeton. Neat attempt to identify where reasoning models help. But with all the action in reasoning models/inference-time-compute it's bound to get outdated quickly.
Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)
By your favorite AI youtuber Yannic Kilcher. This is Meta's paper about getting rid of tokens.
DeepSeek-V3 Technical Report
Interesting notes about several optimizations enabling fast FP8 training, use of Multi-Token Prediction (with ablations!) and interestingly R1 (their reasoning model) distillation for improved reasoning. Additionally they introduce a MoE routing collapse prevention technique they dub "auxiliary-loss-free load-balancing".
Scaling Test-Time Compute with Open Models
Awesome exploration of scaling test-time compute with open models by Hugging Face. "Check out this plot where the tiny 1B and 3B Llama Instruct models outperform their much larger 8B and 70B siblings on the challenging MATH-500 benchmark if you give them enough “time to think” 🤯. Very cool result by team HF!

Want more? Follow me on X! @ricklamers

Gemini 2.0: is Google finally where everyone expected it to be?

Rick Lamers — Sun, 15 Dec 2024 18:42:45 GMT

This week covers 41 (!) updates about what’s happening in AI: spoiler - a lot. I’m sorry I had to skip y’all for a week but it just was so busy at Groq (we shipped Llama 3.3 70B + specdec version - this runs it at over 2000 tok/s on certain queries, check it out!). Enjoy this week’s update ✌️

📰 News

DeepMind releases Genie 2: a foundational world model
Similar to GameNGen by Google Research earlier and Oasis by Decart. It allows a user to provide inputs to have real-time control over a world simulation engine. It seems to suffer from similar issues of earlier attempts like persistence. "we believe Genie 2 is the path to solving a structural problem of training embodied agents safely while achieving the breadth and generality required to progress towards AGI". You can be sure that teams like the folks behind the Tesla Optimus humanoid robot are experimenting with these world simulation techniques.
Gemini 2.0
Google really outdid themselves with this release, not only does the model score exceptionally well on coding benchmarks versus Claude Sonnet 3.5 (new) but also does it ship with advanced agentic features like realtime voice + video input modes in the Live API. Google is now close to or depending on your POV leading in the AI race and it’s what people expected from the company since the open sourcing of TensorFlow. Especially notable is that they’re still completely unbeaten on video input and long context (up to 10M and 2M in prod).
OpenAI announces realtime video mode
Similar to Gemini Flash 2.0 realtime mode. Only Google got there first this time :) When access rolls out more broadly evals will likely show which of these does best (under which conditions). Tracking!
ChatGPT Pro Mode: o1 pro mode
o1 pro mode scores high on reasoning tasks like PhD-Level Science Questions (GPQA Diamond), Competition Code (Codeforces), Competition Math (AIME 2024). It's pretty slow with a loading bar and has been described as "integrated best-of-n sampling" giving you sort of a pass@k level performance where k is something like 5 versus regular o1. Not sure if it warrants the $200 price tag associated to ChatGPT Pro Mode. Others have pointed out that $200 is maybe worth it for the increased o1 (non pro) rate limit increase. I guess it will be TBD how much traction the high price plan for ChatGPT will get.
DeepSeek releases update V2.5: 1210
Improved math & coding ability wrt original V2.5: MATH-500 benchmark increased from 74.8% to 82.8%. LiveCodebench (08.01 - 12.01) benchmark increased from 29.2% to 34.38%.
Microsoft releases Phi-4: 14B math buff
"Phi-4 outperforms much larger models, including Gemini Pro 1.5, on math competition problems". The technical report has been lauded for digging into the synthetic data generation strategies used. https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090 and unofficially on HF https://huggingface.co/matteogeniaccio/phi-4.
Snowflake releases strong permissively licensed multi-lingual embedding models: Arctic Embed 2.0
Very useful! Around the level of text-embedding-3 from OpenAI at 1/3rd of the dimension size.
Amazon releases Nova model series
Check out the technical report here https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card. Their best model is approximately Llama 3.2 90B/Llama 3.1 70B level but Nova isn't open source. The report also highlights image generation and video generation models, showing that these models are presumably becoming table stakes technologies for large tech companies like Amazon/Google/Tencent/Alibaba/Meta etc.
$1M prize for beating a new SWE-bench like benchmark
Unreasonably high requirement (90%? open source only?). But interesting initiative to track SWE agent progress regardless.
DeepSeek-VL2: MoE A4.5B vision model competitive with Qwen 2-VL 7B/Pixtral 12B
Meaningful performance above Pixtral 12B at a much lower active param budget. Quite impressive.
Grok image generation model Aurora released
Grok 2 initially relied on FLUX for image generation on the X platform but now the team seems to be getting their image gen muscle on. Images are decent but not yet surpassing SOTA like Ideogram or FLUX1.1 [pro]. No API release as of yet.
Nous DisTrO: distributed model training
Nous has been making nice progress with distributed model training, one of the neat results is the DeMO paper which was co-authored with Diederik P. Kingma the author of the original Adam optimizer. DeMO shows controlled divergence can enable large scale distributed training. The Nous model doesn't have great MMLU scores that can compete with the slew of open source models available (all trained on fast interconnect clusters of course), but it's encouraging to see work in this area to guarantee we'll be less dependent on centralized training initiatives if we need to be in the future. In essence, this line of work is a nice hedge on centralization power. Which likely helps the centralized initiatives behave better as a result ;-)
Fish Audio 1.5 release: SOTA open source TTS
The gap between proprietary (ElevenLabs) and open source is shrinking. Try the playground at https://fish.audio/ it's very impressive. Check out the ranking of TTS models in the helpful HF leaderboard https://huggingface.co/spaces/TTS-AGI/TTS-Arena

📦 Repos

Flow Matching in PyTorch educational resource by Meta
"Flow matching is a recent framework for generative modeling that has achieved state-of-the-art performance across various domains, including image, video, audio, speech, and biological structures." for example, FLUX.1 is based on it as well as many other SOTA generative models in other multimodal domains.
HunyuanVideo: strong open source video model
There's also support for distributed inference using xDiT framework (xDiT: an Inference Engine for Diffusion Transformers (DiTs). Very cool work brining SOTA to local inference. A friend of mine is already generating videos with it on his 4090s!
Web Applets: An open spec & SDK for creating apps that agents can use
Really cool project by Rupert Manfredi incubated at Mozilla.

📄 Papers

Adam-mini: Use Fewer Learning Rates To Gain More
Reduced memory footprint version of the successful Adam optimizer. Awesome work for hobbyist fine-tuning that typically try to squeeze the most out of their hardware. Support seems to have already landed in LLaMA-Factory https://github.com/hiyouga/LLaMA-Factory
🦣 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
"Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales." nice study that shows that post-training for reasoning in the vision domain has lots of potential. They set SOTA scores on vision reasoning tasks like MMMU-Pro (+7%).
Byte Latent Transformer: Patches Scale Better Than Tokens
Meta frees us from the burden of tokenizers, at last. "We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes." and "Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it." Meta truly GOAT, this is phenomenal.
Attamba: Attending To Multi-Token States
Compressing tokens to avoid "quadratic scaling of compute with sequence length", clever approach and quality hit seems minimal.
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
They explore finetuning on O1 API data (against ToS, but hey, science!) and remark on the importance of not relying (solely) on distillation for achieving SOTA on open models.
Training Large Language Models to Reason in a Continuous Latent Space
"By delaying definite decisions and expanding the latent reasoning process, the model pushes its exploration closer to the search tree’s terminal states, making it easier to distinguish correct nodes from incorrect ones."
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Of course the answer is yes, but the paper highlights in a helpful way that it's only the case for certain types of queries. Shortcut exploitation is a notorious issue with machine learning models and undermines generalization ability of models.
PaliGemma 2: A Family of Versatile VLMs for Transfer
They highlight how their models can be used for transfer-learning to domain specific tasks like "table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation".
JetFormer: An Autoregressive Generative Model of Raw Images and Text
"... most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images." I like approaches that are aligned with the Bitter Lesson https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
This training-free sampling guidance method meaningfully boosts the quality of the generated videos. Magical.
The broader spectrum of in-context learning
This paper by folks from Google DeepMind places the phenomenon of in-context learning in a broader context of research on generalization and meta learning. Helpful if you're interested in learning more about in-context learning and how it will develop over the next years of AI research.

🛠️ Products

GitHub Copilot on the web
GitHub now has a chat feature that supposedly integrates with GitHub data (through GitHub APIs) well. My test queries didn't fare very well, let me know in the comments if you have more success!
Luma Labs releases image generation model Photon
Competitive with other SOTA. Not open source. Stunning images, but fails my "a coffee cup upside down on a table" prompt adherence test. Oh well, more work to be done I guess :)
Repo Prompt: codebase to prompt
Utility to generate prompts from your codebase. Handy for models like O1 Pro Mode that don't have APIs yet. A handy crutch for the gap between those models having APIs (at which point the Repo Prompt utilities will just be handled by Cursor/Windsurf AI-first coding IDEs).
Serper: Google SERP search for LLMs
Came across this while looking at the Arena Agent leaderboard (cool project by Gorilla, see link in this newsletter). https://www.agent-arena.com/leaderboard

📚 Resources

The Future of Math with o1 Reasoning with Terence Tao and Mark Chen (SVP of Research @ OpenAI)
As one commenter suggests, the TLDR is "in the short run will help mathematicians develop proofs, but in the long run probably replace humans entirely."
Agent Arena: a leaderboard for agentic tasks
Cool project by the Gorilla team. LMSYS style leaderboard on agentic tasks like search, stock/financial data manipulation, research, automation.
A failed experiment: Infini-Attention, and why we should keep trying?
Good to see these kinds of write-ups, they are often more educational then just sharing a successful attempt.
AI safety benchmark AILuminate by MLCommons
Useful resource to get a quick safety score for models.
NeurIPS talk by Ilya Sutskever
"Pretraining is dead, long live compute"
Reward Hacking in Reinforcement Learning
As mentioned earlier in this newsletter, models are susceptible to taking reasoning shortcuts (as explored in the multi-hop reasoning scenario). Lilian Weng always does a phenomenal job writing up ideas in ML and this blog post is no exception. This post dives mainly in the problem of reward hacking while a promised future post will go into mitigations. Lilian has actually just recently left OpenAI so maybe she's a bit more free to discuss these ideas moving forward. Tracking!
Sora review by MKBHD
Surprisingly informed and grounded review of Sora from the perspective of actual users (not academics). Watch this is you don't have Sora access (they stopped accepting signups).
List of interesting LLM benchmarks by Ir. Thomas
For example, it tracks benchmarks where humans still hold the record vs SOTA AI. That list might shrink faster than we'd like haha!
Scaling Automatic Neuron Description
Cool work scaling mechanistic interpretability by describing neurons of Llama-3.1-8B-Instruct. They release a dataset with descriptions of every neuron in the model!
Inference Time Compute
Tutorial on Inference Time Compute by a 3rd year PhD student from UChicago.

Want more? Follow me on X! @ricklamers

Model Context Protocol: making LLMs more useful

Rick Lamers — Sun, 01 Dec 2024 20:18:33 GMT

Model Context Protocol architecture

📰 News

Anthropic launches Model Context Protocol
"open protocol that enables seamless integration between LLM applications and external data sources and tools". It uses JSON messages to allow for communication between:

Hosts: LLM applications that initiate connections
Clients: Connectors within the host application
Servers: Services that provide context and capabilities

Using MCP folks will be able to use LLMs more directly with the applications they use every day. Because the protocol is open it requires less coordination between walled gardens of your data silos. Anthropic probably identified that the value will be much greater to let this proliferate as an open standard versus making this something proprietary that gets little adoption. Does HTTP ring a bell? I think this is a launch to follow. Start playing with the awesome-list MCP servers linked in this post!
Ai2 releases OLMo 2: the awaited SOTA truly Open Source LLM
They announce a model competitive with the Llama 3.1 8B model with a truly open approach (they open source pretraining code, data, weights, under permissive licenses, etc.). Truly a gift to the community.
SmolVLM - small yet mighty Vision Language Model
Not better than Qwen2-VL 2B but beats moondream2 and PaliGemma 3B. Impressive contribution by the HF team for small footprint VLMs. The efficiency can make this interesting for large scale data processing too like data extraction.
AI Agent prompt hacking competition wins developer $50k
Neat concept: an AI agent controls an account with an initial $3k balance. You can attempt to prompt-convince it to wire you the money (over crypto) through its function calling abilities (approveTransfer). It's instructed however to not wire any more. The anonymous developer took home 30% of the prize as a fee so that was a quick way to earn $14.1k :)

📦 Repos

Mochi 1 LoRA Fine-tuner
Neat single script video language model fine-tuning script that can run on a single H100/A100.
Awesome MCP Servers
MCP is getting adopted fast, a sprawling collection of projects making models more useful by glueing services through the Model Context Protocol. "@modelcontextprotocol/server-filesystem 📇 🏠 - Direct local file system access." some are pretty tricky if used maliciously! Use with caution. And of course, if you build any cool MCP servers put them on the list with a PR!
voice-pro: zero-shot Voice Cloning & more
Controversial topic with deep fake attacks running rampant, but I think it's good to democratize the technology and make everyone robust against the existence of this technology. In addition, it serves as a great learning tool and a cost reducer for folks who have a genuine use case (automate your tutorial recordings :)?).
steel-browser: a tool for building browser/agent automations
Very cool concept, AI-powered Selenium/Playwright is taking flight.
Yet Another Agent Framework (YAAF): Multi-Agent Orchestrator by AWS
It's not called YAAF I made that up, but at this point I wonder how many more agent frameworks we need :)
Srcbook: AI-first app development platform
Forget about mobile first, AI-first app development platforms lean into strengths/weaknesses of AI code gen limits to facilitate building apps quickly. Neat idea! Kind of the natural expansion of Claude Artifacts?

📄 Papers

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Mechanistic Interpretability work can be very tedious, the LLMs-Lab Team from NTU did what any reasonable AI-loving engineer/researcher would do: use AI. They've repurposed VLMs (like LLaVA-OV-72B) to identify features to be used for model steering. Looks a lot better than scaling manual approaches that are prominent in MI work to date.
Star Attention: Efficient LLM Inference over Long Sequences
A contribution from NVIDIA to reduce inference time 10x with minimal quality loss (95-100%). An approximation allows for parallel and low-communication overhead calculation of attention. There's also a repo: https://github.com/NVIDIA/Star-Attention
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
“our experimental results across several LLMs show that LLMs perform arithmetic using neither robust algorithms nor memorization; rather, they rely on a "bag of heuristics" -> really shows that circuitry formation inside LLMs leaves a lot to be desired, will be an important branch of research for frontier labs.

📱 Demos

Quark: Real-time, High-resolution, and General Neural View Synthesis
Cool neural rendering result by Google, impressively high quality. A bit of a detour from the usual LLM/VLM focused programming, but a neat application of neural networks nonetheless.

📚 Resources

An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability
Good starter post if you want to learn about techniques used for creating feature steering techniques like “Golden Gate Claude”.
AI Benchmarking Hub by Epoch AI
Epoch AI has been dropping resources for keeping up with progress in the field of AI and this benchmark database is a welcome addition to see SOTA results at a glance. Ty guys!
MCP cost/speed/intelligence trifecta
This was a neat find by the https://buttondown.com/ainews newsletter. MCP has a built-in way to set preferences for cost, speed and intelligence. Likely the inputs to a model router system that will handoff to the right model automatically based on the request.
Quantization-Aware Training for Large Language Models with PyTorch
There is more evidence (https://arxiv.org/abs/2411.04330) building that precision during training influences post-training inference optimization techniques like quantization. Therefore, it's great to see PyTorch prioritizing making techniques like quantization-aware training more accessible. Especially for more broadly performed activities beyond frontier labs like fine-tuning.
Model Context Protocol (MCP) Quickstart
Neat write-up on Anthropic's newly launched Model Context Protocol (MCP) to get the TLDR.

Want more? Follow me on X! @ricklamers

Open Source o1 has (almost) arrived: DeepSeek R1-Lite-Preview

Rick Lamers — Sun, 24 Nov 2024 21:46:23 GMT

📰 News

DeepSeek-R1-Lite-Preview: o1-preview-level performance on AIME & MATH benchmarks
Awesome to see that moats aren't very long lived. DeepSeek promised to open source this model too, hasn't dropped yet but I'm sure the community is anxiously waiting.
XGrammar: new Structured Generation library by the MLC project
The repo: https://github.com/mlc-ai/xgrammar
step-2-16k-202411: a mysterious 1T model just appeared
Livebench.ai is a neat benchmarking attempt that constantly updates to avoid being saturated by Training On the Test Set. step-2-16k-202411 ranked just below o1-mini and above last week's entrant gemini-exp-1114. Although 1T seems prohibitively expensive to run. Reportedly Claude Opus 3.5 isn't being deployed for the exact reason of the economics of the scale of the model not weighing up against the performance delta versus other (smaller models) like Claude Sonnet 3.5.
Pixtral Large 124B
It performs well, from the blog: outperforming all of Claude-3.5 Sonnet (new). Unfortunately, only available under non-commercial research license. Great work though from the folks at Mistral, and kudos for an open weights release!
Qwen2.5-Turbo: 1M long-context Qwen
Impressive performance and evals (NIAH, RULER and LV-Eval). Congrats team Qwen! It's not released as a model/inference setup though, API only.
Marco-o1: open source o1?
CoT fine-tuning, MCTS for search. First impression this doesn't look very good. At least on the small model they performed these steps on it's only marginally better than Qwen 2 7B.
Want to also make one remark about Google DeepMind folks dropping another model, the gemini-exp-1121 model, which purportedly is even better than gemini-exp-1114, YMMV but whatever the case, I’m glad labs are making improved models available more quickly to end users!

📦 Repos

LTX-Video: more open source generative video models
Impressive model by an impressive team. I'd give this a look if you care about generative video models/inference.
LLaVA-CoT: first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1
Reasoning for multimodal models. Like Insight-V from this week's roundup.
1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
Microsoft drops the official implementation for very-low-bit LLMs and it's fairly efficient! Paper accompanying the software release https://arxiv.org/abs/2410.16144 and the original BitNet paper https://arxiv.org/abs/2402.17764
Tülu 3: open model + code for SOTA post-training
Specifically their Reinforcement Learning with Verifiable Rewards (RLVR) treatment is worth taking a look at. Awesome work by the folks at Ai2.
AIMv2: vision encoders by Apple
Awesome follow-up work by Apple. Autoregressive Pre-training of Large Vision Encoders, see https://github.com/apple/ml-aim also.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper + Code + Checkpoints. With o1 being text only it's great to see the next frontier is rapidly being explored: multimodal reasoning chains. Should help a lot with embedded AI like robotics.

📄 Papers

BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games
The Return of Games for benchmarking. Pre-ChatGPT/GPT-3 using games to benchmark AI was all the rage with Lee Sedol's famous Go match and OpenAI beating pro humans in Dota 2 and of course the famous Atari benchmarks. We'll see how things go when using games to evaluate LLMs. Signal or noise? The current leaderboard seems to decently rank models although in the top ranks it's not clear whether the ranking is really definitive (Llama 3.1 < GPT-4o-mini? Llama-3.1-70B-it > Llama-3.2-90B-it?).
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
"begin to copy the exact reasoning steps from the training set" a useful observation that can lead to higher quality pre-training by focusing on including general reasoning patterns in the pre-training corpus. Or even steering towards these in post-training to eliminate flawed non-general reasoning patterns.
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
"improves the perplexity of the model without increasing its size ... an additional averaging step after each transformer block ... coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers" looks like carrying information forward is important. Will be interesting to see if this "architecture trick" is adopted by open source model providers. Of course, proprietary labs might already know of this and use it (or similar) in their models.

📱 Demos

Ai2 Playground: Tülu3 70B
Neat chat playground to try out the Tülu3 model
MagicQuil: An Intelligent Interactive Image Editing System
Very cool project, not open source, but they do have a demo up.
Text Behind Image
A neat narrow-focus image model by folks from 🤗 Does what it says on the tin!

🛠️ Products

Morph: Infinibranch
Interesting idea of branching agents through snapshotted VMs. Potentially powerful in the new search oriented inference-time compute scaling that labs are exploring. Private preview unfortunately it seems.

📚 Resources

AI eats the world by Benedict Evans
A thought piece by Benedict Evans ex-a16z partner on where AI is going. A nontechnical more market focused lens I think provides a helpful view on what practitioners, even those operating at the detailed technology level, can expect over the next few years. Although "It will work like every other platforms shift" <> "No-one knows" does more likely than not leave figuring out the actual answers as an exercise to the reader.
Say What You Mean: A Response to 'Let Me Speak Freely'
Outlines authors publish a rebuttal on https://arxiv.org/abs/2408.02442 TLDR with precise handling structured output _does_ improve performance.
Example prompt from Yann LeCun for reasoning models
DeepSeek-R1-Lite-Preview cracks it!
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Interesting paper highlighting how reasoning is influenced by pretraining in LLMs.

Want more? Follow me on X! @ricklamers

Gemini Exp 1114: overfitted to benchmarks or new king?

Rick Lamers — Sun, 17 Nov 2024 22:59:09 GMT

Exp 1114 = Gemini 2?

📰 News

Athene-V2: Advancing Beyond the Limits of Scaling with Targeted Post-training
Fine-tuning for function calling/Agent use cases seems to be getting more attention. With more powerful models to tune from (Qwen 2.5 72B in this case) the feasibility increases, even when operating on a budget.
Google's new TPU: Trillium (v6e)
Fun to explore on GCP with JAX.
Supermaven joins Cursor
Normally I don't feature M&A activity in AI but this one in particular is interesting given the momentum of Cursor for AI-assisted coding. This will strengthen their lead and lead to even more momentum. If you haven't tried Cursor, now is _really_ a good time to start getting familiar.
4th edition of Conference on Lifelong Learning Agents (CoLLAs)
Scheduled for Aug 11, 2025. Defined as "systems that can continually learn throughout their lifetime".
Gemini 1114 breaks (almost) all records
It seems to outperform Sonnet 3.5 (Oct) in certain cases, even outperforming o1-preview in some cases when prompted to using CoT. Rumored to be "Gemini 2". It didn't pass all of my vibe questions so I'm yet convinced this model is a clear #1.
LMSYS Arena update: Gemini Exp 1114 takes #1 spot overall
More on the Exp 1114 release. It scores well in Arena but we know that isn’t the full story. Leave in the comments how well it works for you!

📦 Repos

fixie-ai/ultravox-v0_4_1-llama-3_1-70b
A Llama 3.1 70B backbone for Whisper encoder to fuse speech-to-text into the language model itself, reducing the need for orchestration.

📄 Papers

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
Test-Time for the ARC challenge. Interesting ideas by a team from MIT.
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Adaptive quantization and efficient low-precision implementations can push the needle on efficiency. Paired with the Scaling Laws for Precision paper this presents an interesting push on the frontier of efficient AI systems.
Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks?
TLDR "Strikingly, we find that many models are remarkably thread-safe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows."
Scaling Laws for Precision
"For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal."

📱 Demos

X-Portrait 2: Highly Expressive Portrait Animation
For those keeping track of SOTA performance transfer in video/image-to-video here's the latest from ByteDance, it does a lot better than ACT-One, RunwayML's very recent flagship release, in the examples shown.
RMBG-2.0 for background removal by BAAI
Neat open source model for a practical task like BG removal.

🛠️ Products

Bolt.new create any Web App using agent prompting
Claude Artifacts on steroids! Instant deploy too. Neat product. No affiliation.

📚 Resources

Dwarkesh Patel interviews NLP researcher Gwern
Letta introduces tool use with constraints
"TerminalToolRule(tool_name=...) - If the tool is called, the agent ends execution, InitToolRule(tool_name=...) - The tool must be called first when an agent is run. ToolRule(tool_name=..., children=[...]) - If the tool is called, it must be followed by one of the tools specified in children"

Interesting ideas for steer-ability!
llms.txt spec by Answer.AI: plaintext docs for AI
Anthropic already implemented it https://docs.anthropic.com/llms-full.txt / https://docs.anthropic.com/llms.txt I wonder how folks will go about prompt injection issues with this. I guess it comes down to trusting authorities/domain names.
Effect of quantization on various LLMs including Qwen2.5-Coder-32B-Instruct
Spoiler: Qwen2.5-Coder-32B-Instruct degrades surprisingly little when quantized to lower bit representation. Interesting in light of the 'Scaling Laws for Precision' paper in this week's Papers section.
Can AI Scaling Continue Through 2030?
An investigative report speculating on the bottlenecks to continued scaling of the key components of modern AI. "We identify electric power, chip manufacturing, data and latency as constraints."
Stripe creates APIs specifically for agent financial actions
Virtual Credit Cards for your agents? Awesome AI forward features shipped by Stripe.
Speculations on Test-Time Scaling (o1) by Sasha Rush
Sasha is a professor at Cornell Tech and works at Hugging Face.

Want more? Follow me on X! @ricklamers

Chinese labs drop some incredible models

Rick Lamers — Sun, 10 Nov 2024 22:25:30 GMT

The 389B A52B Instruct model performance

📰 News

OpenCoder 8B: The Open Cookbook for Top-Tier Code Large Language Models
Reproducible LLMs from a Chinese lab, very awesome contribution and actually impressive performance, outperforming the very strong Qwen 2.5 Coder 7B.
Hunyuan-Large: A52B MoE by Tencent
This supposedly beats Llama 3.1 405B on a number of benchmarks, very impressive. License is non EU and sub 100M users.
Hunyuan3D-1 an open source Text-to-3D and Image-to-3D model by Tencent
Impressive object generation! Game devs/AR/VR enthusiasts will love this. Homepage https://3d.hunyuan.tencent.com/
Mistral launches Moderation API
Fairly high accuracy across all categories. $0.1 per 1M input tokens.
FrontierMath: A math benchmark testing the limits of AI
Hard evals that LLMs can't crack have a tendency to accelerate the field. This new math eval by Epoch AI is a banger release that puzzles even o1-preview, Claude 3.5 Sonnet (10-22), and Gemini 1.5 Pro (002) all scoring 1-2%.
FLUX1.1 [pro] Ultra and Raw Modes
4MP in sub 10 seconds is very impressive of an image model with this level of quality. Go team Black Forest Labs! Raw Mode is a nice alternative to the Midjourney-vibe we've been getting from most text-to-image models.
OpenAI launches predicted outputs feature
"Predicted Outputs enable you to speed up API responses from Chat Completions when many of the output tokens are known ahead of time. This is most common when you are regenerating a text or code file with minor modifications." Edit style prompts can be 2x faster, as a rule of thumb.
Claude Haiku 3.5 released
It costs about a third of 3.5 Sonnet but is meaningfully worse. It's a bit faster than 3.5 Sonnet but only by about 16%. I don't think this is Anthropic's best launch.
From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code (found SQLite buffer underflow vulnerability)
Google's security effort under Project Zero has successfully used LLMs to find code vulnerabilities. Awesome and highly technical writeup.
hertz-dev: open-source base model for conversational audio generation
Permissively licensed, impressive samples, great write-up.

📦 Repos

AMD-OLMo 1B
AllenAI's OLMo's models training code ported to AMD stack. Neat contribution to accelerate AMD <> NVIDIA race.
Tess-R1 Limerick (Llama-3.1-70B)
Another test-time-compute built-in finetune of Llama 3.1 70B that claims improved benchmark performance. Explore and leave in the comments how well you think it does. Here's a HF Space https://huggingface.co/spaces/poscye/chat-with-tess
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Interesting agent stack by Microsoft - equipping LLMs with browsing, terminal, filesystem capabilities for open ended goal accomplishment.
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
This work from a team at Stanford finds a way to identify and mitigate spurious correlation in Vision Language Models.

📄 Papers

B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable
Must read if you're interested in interpretability work for deep neural networks. This paper focuses on the image domain mainly. By folks from Max Planck, Kyutai & more.
Cosmos Tokenizer: A suite of image and video neural tokenizers
Important work in tokenization of multimodal content, tokenizers can form a glass ceiling for performance because of information lost in the process of tokenizing content. Amazing work by NVIDIA and published with a very high level of detail!
A Scalable Communication Protocol for Networks of Large Language Models
This approach to allowing agents to communicate for agent-swarm like applications looks promising. If you've been put off by the verbiage usually used in "distributed agent" work then this will be a fresh, well thought out piece of research.
ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate
Various folks on X have been touting this as being strictly better than Adam, meaning they would always get better performance from this compared to running Adam, and that would be a large feat. Better optimizers matter when efficiency gains translate into thousands, tens of thousands or even larger dollar/energy savings at large scale pre- and finetuning.

📱 Demos

Vision Language Models are In-Context Value Learners
Try the Interactive Illustration, it impressively demonstrates how complementary robotics and vision language models are. Expect acceleration of Robotics especially in control like bimanual manipulation tasks. By DeepMind, UPenn, and Stanford.

📚 Resources

Simon Willison exploring tool use feature for his `llm` CLI project
Cool GitHub issue!

Want more? Follow me on X! @ricklamers

Full World simulation just had its ImageNet moment

Rick Lamers — Sun, 03 Nov 2024 19:57:41 GMT

I don’t believe folks are grasping the implications of this achievement yet. But the ability to simulate full world environments at increasingly higher levels of fidelity will usher in an era of robotics and real-world reasoning the consequences of which are hard to fully comprehend. Incredible work from the folks at Decart.

📰 News

OpenAI introduces SimpleQA benchmark
In an attempt to curate unsaturated benchmarks. Note, OpenAI has still done significantly more for open source than Anthropic. Something to ponder about :) OpenAI's o1-preview gets ~42% and interestingly also refuses to answer (instead of just hallucinating).
Recraft v3: most powerful (closed) image generation model
They also launch with an API. It's number one on Artificial Analysis arena leaderboard. https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard

📦 Repos

Embedding inference engine
Very cool project by Michael Feil from Gradient AI.
Progress on o1 repro: MCTSr: Mathematic as a Blackbox for LLM
Early stage of a project attempting to reproduce o1, good source of raw ideas if you're working on this yourself.
SmolLM2: powerful 1.7B SLM (small language model)
Great model by Loubna Ben Allal from Hugging Face. Beats Qwen2.5-1.5B in multiple categories.
Stagehand: an AI web browsing framework by Browserbase

📄 Papers

MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
I've been very interested in a tokenization-free approach to LLMs and this paper from Stanford nails it. Check this out if you think tokenizers are bottlenecking LLMs too!
Bayesian scaling laws for in-context learning
Interesting approach to modeling scaling laws for In-context Learning ability of LLMs.
Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
"When applied to text-to-speech (TTS), these models (AR Transformers) tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues."
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
By Meta AI.
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Useful new benchmarks for VLMs. VLMs are often used for structured extraction in practice, so this benchmarks is not very academic but well aligned with applied quality needs. By Percy Liang's group at Stanford.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
"reduces the number of video tokens while preserving visual details of long videos" neat! By folks from Meta AI, a video content powerhouse.
Anon ICLR submission: Towards Learning to Reason at Pre-Training Scale
Interesting idea! "given the first tokens from a large pre-training corpus, the model generates a CoT and receives a reward based on how well the CoT helps predict the following tokens"
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
By Google DeepMind.
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Researchers might not have tons of compute, but luckily they are smart. This paper solves the problem "When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch". This computational efficiency gain can spur faster iteration of architectural ideas, Neural-Architecture-Search let's go!
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Very strong open source non-autoregressive TTS model. Demo space on Hugging Face https://huggingface.co/spaces/amphion/maskgct

📱 Demos

OmniParser running in the browser with Transformer.js
Very impressive and useful demo showing how to run OmniParser in the browser directly. As others remarked on X, this has potential to be a core building block for browser extensions.
Decart launches Oasis: playable simulated Minecraft
This is a phenomenal achievement. It sets the stage for full world simulation. Remember what the first generated images/videos looked like. What makes this launch even more remarkable is that both weights and an interactive web demo with a limited queue are available. Just WOW.
AlignEval: a game/tool to help you build and optimize LLM-evaluator
Very cool project! Source is on GitHub. And this X thread

📚 Resources

Waymo introduces Emma
It's built on top of Gemini's multimodal capabilities.
LLM as a judge for business value by Hamel Husain
You won't find better applied AI findings than this.
NotebookLM's TTS system explained by Google DeepMind
OpenAI Audio generation endpoint
Note this is separate from real-time audio. It allows these combinations: text in → text + audio out audio in → text + audio out audio in → text out text + audio in → text + audio out text + audio in → text out
[Video] Learning to Reason, Insights from Language Modeling
By Noah D. Goodman a researcher from Stanford.

Want more? Follow me on X! @ricklamers

Claude Computer Use: RPA on steroids

Rick Lamers — Sun, 27 Oct 2024 15:16:17 GMT

What a BUSY week! Both for me personally (Sunday newsletter day, yay!) and in AI at large. I think everyone saw the Claude Computer Use release and tried answering the question: how ready is this? See this week’s resources for a hint of the risks currently involved and play with it on your device through Agent.exe - proceed with caution!

📰 News

Claude Sonnet 3.5 20241022 releases tops Aider leaderboard
For more about upgraded Claude 3.5 Sonnet and Claude 3.5 Haiku see Anthropic's blog post https://www.anthropic.com/news/3-5-models-and-computer-use
Mochi 1 Preview: open source text-to-video model
Samples are impressive. It's great to see a push for open source text-to-video. More people can experiment, learn about mainstream approaches to text-to-video and of course tons of cool clips to generate at compute-cost-price.
Runway releases Act-One: expressive character performance
Works by transferring character performance from a source video to a target generated character. Interestingly, still depends on convincing human performance.
Ideogram releases Dingboard like feature called "Canvas"
Companies in AI move fast and aren't afraid to ~steal~borrow good ideas from each other.
xAI releases API
IBM introduces Granite 3.0 models
Model is close to Llama 3.1 8B at 8B size. Released under Apache 2.0.
Microsoft releases OmniParser: a UI vision extraction model
"OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent."

📦 Repos

E2B releases Desktop Sandbox feature beta
Jumping on the Claude Computer Use bandwagon. Awesome to see them move fast.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Implementation of the identically titled paper. Awesome efficiency work by the MIT HAN Lab.
Anthropic Computer Use reference implementation
A containerized Linux desktop for isolated LLM-powered computer use. Fun experiment, evals are missing. Related docs https://docs.anthropic.com/en/docs/build-with-claude/computer-use
Inf-CLIP: Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
What is very exciting about this result is that most step-change performance unlocks come from discovering computational efficiencies that allow for further scaling. Scale Is All You Need!
bitnet.cpp: the official inference framework for 1-bit LLMs
E.g., BitNet b1.58, by Microsoft. Acceleration is still suboptimal because the lack of direct hardware support.
Moonshine: new open source ASR
Licensed under MIT, claims to outperform Whisper "better than similarly-sized Whisper models from OpenAI".
outlines-core
Structured generation in Rust

📄 Papers

LLaVA-Video-(7B|72B) by LLaVA team
Impressive adaptation of open source for video understanding. Includes Hugging Face checkpoints, dataset and training code.
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Interesting work by Salesforce on video encoding.
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Interesting work to achieve higher inference performance, not clear which tasks suffer dramatically from the layer skipping described in the paper.
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
The results point to collapse due to presence of synthetic data depends on a lack of real data, as long as it is present, synthetic data in the mix doesn't seem to result in a performance collapse.
Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
Addition is All You Need for Energy-efficient Language Models
Interesting ideas for making LLMs more efficient but doesn't definitively show whether performance holds up at the highest end of the performance spectrum (e.g. Llama 405B on complex reasoning).
Selective Attention Improves Transformer
Interesting idea that intuitively makes sense: not everything in the context window matters for predicting the next token. The authors present an approach to selectively applying attention to the context window.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
"Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling." and "Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance." awesome work in a deep cross-collaboration between Mila, Allen Institute for AI and DeepMind.

📱 Demos

Claude Computer Use on your actual computer
Needless to say, this is very risky to run. But it's pretty cool!

📚 Resources

OpenAI introduces sCM: breakthrough in image generation speeds
Hailuo SOTA Text|Image-to-Video model
Very impressive results.
o1 mechanism rumors could be useful for replication
By Philipp Schmid from Hugging Face. He draws comparison to the Stream of Search people released by Stanford researchers in April 2024.
Epoch AI shares a Machine Learning Hardware Database
AMD MI325X and GB200 unsurprisingly take the performance crown. Cool dataset!
Vectara Hallucination leaderboard
Zhipu AI GLM-4-9B-Chat is a surprising #1 scoring model
Claude ships Analysis tool to Claude web using JS sandboxing
Short exploration by Simon Willison
ZombAIs: From Prompt Injection to C2 with Claude Computer Use
Great example of Computer Use vulnerabilities. It's great Anthropic and E2B are pushing computer using LLMs forward but we need clear demonstrations of how they are vulnerable to security risks. The author of Embrace The Red does a great job. They show an E2E example of getting an agent to run malware on the target VM (still sandboxed).

Want more? Follow me on X! @ricklamers

Edge AI makes waves: Qwen 2.5 Code Interpreter in your browser

Rick Lamers — Sun, 20 Oct 2024 19:28:48 GMT

Today’s demo is the most awesome Edge AI application I've seen, check it out. Python generated & executed fully in your browser.

📱 Demos

Qwen-2.5-Coder 1.5B with access to an in-browser code interpreter

📰 News

Mistral releases 2 Ministral edge focused models
They release a 3B and 8B for mainly embedded use cases. It seems to compare favorably against Llama 3.2 3B on various benchmarks and against Llama 3.1 8B respectively.

📦 Repos

OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models
Swarm by OpenAI
A multi-agent orchestration framework.
Answer.ai releases fastdata: a synthetic data generation library
The repo: https://github.com/AnswerDotAI/fastdata
Llama 3.1 Nemotron: finetune by NVIDIA
The mention a slew of high benchmark scores: "This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo" not sure how well that translates to real-word performance. Run your own evals 🙌 More interestingly, they detail the alignment procedure they used to create the model.
Meta Lingua: a hackable training and inference library
Looks very easy to modify and play with for trying out ideas.
Janus-1.3B, a multimodal understand + generation model by DeepSeek
"Janus is a novel autoregressive framework that unifies multimodal understanding and generation"

📄 Papers

Meta releases Self-Taught Evaluators: iterative improvement through synthetic data
Code is available on GitHub https://github.com/facebookresearch/RAM/tree/main/projects/self_taught_evaluator and weights on Hugging Face https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models
Divergent CoT (DCoT) in short: "further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step". Interesting idea!
What's the Magic Word? A Control Theory of LLM Prompting
I like this exploration of a more principled approach to prompting.
Thinking LLMs: General Instruction Following with Thought Generation
"We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision."
Spirit LM: Interleaved Spoken and Written Language Model
By Meta, a foundation multimodal language model that freely mixes text and speech.
MEXMA: Token-level objectives improve sentence representations
Improved multilingual embeddings by Meta.
Artificial Kuramoto Oscillatory Neurons
By Max Welling et al. "It replaces threshold units with generalized Kuramoto oscillators. It dynamically binds neurons, generates waves of activations, is adversarially robust, calibrated for uncertainty, and can reason. If you ask me: this is the next big thing!"

📚 Resources

François Chollet keynote talk about ARC
Chris Manning - Meaning and Intelligence in Language Models (COLM 2024)
Christopher Manning is the director of the SAIL institute at Stanford.

Want more? Follow me on X! @ricklamers