New frontier? Continuous learning and RL environment scaling
Week 36 of Coding with Intelligence
๐ฐ News
Trust me bro, just one more RL scale up, this one is the real one with the good envs
Chief scientist Ryan Greenblatt at Redwood Research saying the quiet part out loud. There's been a lot of attention for RL environments as a way to improve model's capabilities. The basic idea is to create realistic simulation environments in which language models run as agents (capable of using tools to interact with said environment) to improve the model's capability (by rewarding behaviors/paths that lead to a good result in the environment, like passing unit tests/reaching an objective).
Ryan rightfully questions the narrative that we're close to a very large acceleration due to AI labs scaling up on these kinds of environments. His key argument is that he expects to a decent extent these kind of environments (e.g. basic coding environments with access to a terminal/compiler/unit test running) have already been used by top labs (e.g. at the Grok 4 launch scaled up RL was discussed) and that verification of non coding tasks could remain a bottleneck for creating useful and diverse RL environments.
I especially appreciate the Q&A section where he entertains the strongest counterpoints people might have. Excellent format to defend his points. Overall though, he agrees that there's still a lot of potential in scaling up the number/diversity/quality of RL environments. What I appreciate in particular are the points around there being evidence for better verification techniques for non-trivially verifiable domains as likely evidenced by IMO gold medal (verifying a written math proof is not trivial, especially if not using Lean-like-solvers during inference). And the insight that we might be on an acceleration loop of RL environment creation due to SWE agents accelerating coding so much (as evidenced by Cursor/Codex/SWE-bench results) and that work being (potentially) largely automatable with human scientist/engineer oversight.OpenAI puts the "Open" back in OpenAI with the launch of gpt-oss 20B & 120B models
Awesome open source model drop by OpenAI, these MoE models with high expert count (128 and 32 respectively) perform remarkably well. Especially when combined with function calling.
We're hosting them at Groq and I've build a demo of these models powering an AI chat that can interface directly with a spreadsheet (all in your browser!): https://autosheet.groqlabs.com/ (that project itself is also open source, hack the source!)MCPMark: a comprehensive evaluation suite for evaluating the agentic ability of frontier models
A modern agentic eval that measures the performance of various models on e2e tasks involving MCP servers focused on Filesystems, Notion, Playwright, GitHub, Postgres. Interesting lead for GPT-5, although it's on par with Claude 4.1 Opus on the Notion tasks.
Gemini ships SOTA image editing model
Gemini 2.5 Flash Image aka nano-banana has been floating in the ether for a while now with early access on Yupp.ai and LMArena. The model is incredibly powerful for stable image editing where only targeted modifications are applied. It beats both OpenAI's gpt-image-1 and Gemini's previous model by a wide margin as evidence by the Arena scoring gap. It is so good that on Reddit people are posting comparisons of old Photoshop paid work they've done and how nano-banana zero-shots these tasks, often with higher quality (like getting reflections right). If you were looking for specific and real examples of work displacement than look no further. Incredible release putting the Gemini chat app in the lead of image editing/generation capabilities.
Grok enters AI coding with Grok Code Fast 1
The model is quite fast, but quality seems to be close to/slightly worse than Claude Sonnet 4 in practice. If they can keep the speed and boost the quality to GPT-5 level they would have a real killer combo on their hands. They subsidized the launch heavily by giving away a lot of tokens to the key coding agent (GitHub Copilot, Cline, opencode, Cursor, Kilo Code, Roo Code, Windsurf) and OpenRouter. Leading to increased adoption and them leading on OpenRouter coding token consumption (momentum seems to have kept up even as the free period has ended). I suspect the speed will be copied by others as the qualitative effect of "staying in flow" is really valuable while coding. <groq-shill>If anyone wants to achieve that using novel chip architecture (LPUs) hit me up! </groq-shill>
GPT-5: Key characteristics, pricing and model card
GPT-5 has been out now for close to a month, and I think overall the routing mode has not been received very well (in ChatGPT the Auto mode often selects the fast models on questions that it should really use the thinking model for and gets it wrong as a consequence). The GPT-5 Thinking model is quite good but takes a long time. They've seemed to have pushed themselves out of the attractive "high quality instant but non-reasoning" model market for the moment. Direct non-thinking responses from Opus 4.1 feel in a league of their own compared to GPT-5 Instant as an example.
This article by Simon Willison does a great job capturing all the nuance of the release.
I'd add that in my anecdotal use of GPT-5 inside Cursor the model performs really really well, although for trickier problems I tend to switch to Opus 4.1, which is way to expensive to use for everything (at about 10x the cost of GPT-5). I suspect GPT-5 is good in Cursor because one of the areas it shines in is steerability. And I think Cursor adds a lot of sensible context in their (tool) prompts and I like being quite specific about how the model implements things as I'm generally not vibe coding but just doing "assisted software engineering".Gemini teases Genie 3: a high definition world model simulator that can be manipulated
The example of the paint brush painting a wall with persistence (even when looking away) is quite compelling. No playable demo unfortunately. Will be interesting what the value of world model simulators like this will be moving forward. Will they be the basis for embodied simulation RL training or more of a standalone product experience like promptable game engines for games/interactive experiences?
Strong image editing capabilities, best-in-class open source model alternative to nano-banana/Gemini Flash Image 2.5 with capabilities similar to Flux Kontext (as one Reddit user puts it "I'd say Qwen Image editing is slightly superior to Kontext in prompt following when editing image")
Gemini 2.5 Pro Deep Think ships in Gemini app
Ultra subs only. No API access. But pretty wild such strong intelligence available pretty much self-serve.
Fast Reasoning on GPT-OSS with Speculative Decoding and Arctic Inference
Interesting work from Snowflake accelerating open source inference by using the hidden state of the model to predict more tokens ahead as a form of speculative decoding.
LongCat-Flash-Chat: a 560BA18.6Bโผ31.3B MoE
Another foundational model release from China that pushes the frontier with a novel dynamic activation parameter count that is context window specific. It outperforms DeepSeek V3.1 on certain benchmark truly pulling its weight in the frontier bucket and reaching parity with Claude 4 Sonnet on many coding tasks. An interesting artifact to study and good model to run. Perhaps more importantly, indicating fierce open source competition in China leading potentially to more open source model development acceleration.
GLM-4.5V: a very strong vision language model by Zhipu AI
Based on GLM-4.5 Air (108B) significantly outperforming Gemma 3 27B.
Prime Intellect announces Environments Hub
A cool project to facilitate a push for open source RL environment development. See the "Trust me bro, just one more RL scale up" post for more context on the need for these kinds of RL environments. Will be interesting to see how many environments get created and what experimental RL training runs show in terms of gains from these environments on downstream tasks. At the time of writing about 125 environments seem to have been created.
๐ฆ Repos
Cartridges: Storing long contexts in tiny caches with self-study
An exploration of continuous learning by the Stanford Hazy Research lab to train KV caches labeled "cartridges", they train a KV cache per corpora (a collection of documents) and use the trained KV cache during inference as an alternative to loading the entire set of documents into the context window.
Radical idea with significant performance implications: using 38.6ร less memory and enabling 26.4ร higher throughput. It's still more costly than In-Context Learning (ICL, e.g. just stuffing in the context window) in terms of training time (30m for a Cartridge on a Llama 3 8B) v.s. pre-fill time, but there are ideas for speeding this up.OpenPipe: e2e example of RL training a Deep Research agent on open source model for email search
Noteworthy, OpenPipe was just acquired by CoreWeave. Likely to assist with training open source models for specific use cases using SFT/RL.
The implementation is based on the open source ART learning framework by OpenPipe which itself is based on torchtune.Open-dLLM: Open Diffusion Large Language Models
No other diffusion LLM project had open sourced both data and training code, only projects with inference, evaluation, and weights were available so far. This project changes that with total openness on all of those dimensions. Helpful to study diffusion-based LLMs e2e! Performance on coding tasks like HumanEval lags quite a bit behind other open weight models like Dream 7B (20.8 vs 56.7) so there might be some suboptimal choices in architecture, training, data.
OpenVision 2: simplified pretraining and stronger vision encoder performance
Continued impressive work on vision encoders by the team at UCSC (and folks from Apple/Berkeley). Especially useful due to the full transparency on method, training code and data. OpenVision 2 boasts 2x reduction in training time and memory use while achieving higher quality than the original OpenVision encoders.
๐ Papers
Jointly Reinforcing Diversity and Quality in Language Model Generations
A new paper from the Meta FAIR team says the usual push for accuracy and helpfulness tends to squeeze out response diversity, which hurts creativity. They show you can optimize for both diversity and quality at the same time, with strong results on verifiable math and creative writing. The good news: there seems to be no inherent trade-off. Potentially less AI slop in the future!
VerlTool: a toolkit for tool-integrated reasoning training
They released a paper show performance gains from using VerlTool versus other approaches of training LLM agents to use tools during reasoning. This is like the GPT-5 Thinking/o3 mode in ChatGPT that can use image manipulation and web search while it's rolling out its Chain-of-Thought reasoning. Their framework VerlTool allows you to train this capability into an existing model using RL training algorithms like GRPO and DAPO.
๐ Resources
Interesting artifact to study, but probably not very useful as other better models have been released. https://huggingface.co/xai-org/grok-2/discussions/24 some details on the architecture: 270B Total, 115B Activated MoE with 8 experts. Based on the Grok 2 launch blog post it reaches the level of Claude Sonnet 3.5 on tasks like GPQA and HumanEval (a little below).
The Hidden Drivers of HRM's Performance on ARC-AGI
A breakdown of the HRM's model performance on ARC-AGI-1 (41%) with a 27M model (small!).
Automatically Jailbreaking Frontier Language Models with Investigator Agents
"We train investigator agents using reinforcement learning to generate natural language jailbreaks for 48 high-risk tasks involving CBRN materials, explosives, and illegal drugs. Our results show success against models including GPT-5-main (78%), Claude Sonnet 4 (92%), and Gemini 2.5 Pro (90%). We find that small open-weight investigator models can successfully attack frontier target models, demonstrating an approach to cost-effective red-teaming." couldn't have said it better. Interesting find + deep dive on how jailbreaking isn't solved, even at the frontier.
[Video playlist] GPU MODE meetup talk recordings
GPU MODE usually brings out very competent operators and this time MoE, long context, quantization are discussed (and some more).
Slides of Denny Zhou Gemini reasoning lead at Google DeepMind
Shows reasoning was built on principles of Self-Consistency and that more work on verification is needed to make progress on real-world tasks.
They find that there's a large gap between pass@1 and pass@any which indicates there might be a way to get to much higher pass@1 through further optimization (i.e. since pass@any shows there is some way for the current agents to perform the task in the browser, there might be a way to elicit the right behaviors every time, rather than a need for a fundamental capability unlock to achieve these browser tasks). The best model+scaffold combination scores 42.3% on the Online Mind2Web benchmark while pass@any is 88.3%.
Cursor: 1.5x Faster MoE Training with Custom MXFP8 Kernels
Cursor flexing their model training muscle. With Cursor's phenomenal adoption success it's likely they'll do more & more work to compete with frontier labs on coding (adjacent) models. Interesting deep dive on how to get MXFP8 training to be correct & efficient on Blackwell B200s.
Inside vLLM: Anatomy of a High-Throughput LLM Inference System
Aleksa Gordiฤ has a knack for teaching, this blog post effortlessly runs you through the key components of vLLM, from advanced features like disaggregated prefill/decoding to the basics of single to multi-GPU deployment. Great visualizations too!
Equipping SWE agents with specialized MCP tools to make it operate more intelligently increases the utility and efficiency of these coding agents. ast-grep-cmp gives agents like Cursor and Claude Code the ability to perform structural code search (which as a refresher, is a tool that uses a syntax where you write code patterns with $METAVAR placeholders (like $A, $B) to match and capture AST nodes, allowing you to search for structural code patterns rather than just text).
A nice blog post detailing the gpt-oss models released by OpenAI. From model architecture to the new chat template system (harmony).
Global Software Optimization (GSO) SWE leaderboard
A benchmark for very complex SWE tasks, best frontier models score only 8.9% (OpenAI's o3, scores above GPT-5 Thinking high reasoning effort). What I appreciate in particular is the sample descriptions given showing the root of the task and how models fail in solving them. Very helpful new benchmark as labs are closer to saturating benchmarks like SWE-bench Verified.
>Smoothed SignSGD is more computationally and memory efficient relative to Adam variants, pairing perfectly with its sparsity to make it a compelling candidate for both SFT and RFT
New optimizers being shared, will be interesting to follow the other releases of the WhyPhy team.
Want more? Follow me on X! @ricklamers