Kling 2.0: uncanny valley crossed — video creation will never be the same
Week 16 of Coding with Intelligence
📰 News
Kling 2.0: a new SoTA video model
An incredible model. See the examples on the release notes page. I expect a lot of social media platforms to be FILLED with content from this model cleverly packaged by content creators. Not even in a bad way, just reducing the cost of video creation to near 0.
OpenAI releases o3 and o4-mini
So how are the vibes? Vibes are pretty good. People like o3 and it seems (based on a comment by an OpenAI employee) that o4-mini is really good for vision based tasks. The aider coding CLI polyglot coding leaderboard shows o3 and a combination of o3 + GPT 4.1 reach the highest scores (82.7%) ever observed on Aider beating Gemini 2.5 Pro Preview 03-25 (72.9%). You can check the blog post for all the OpenAI provided benchmarks, which all look good, but in some places, incremental relative to o1.
What's new with these models is that they've been dubbed "agentic" models or "agentic reasoning models" since they're capable of using built-in tools like search, file search, and code interpreter as part of the reasoning token generation. OpenAI also claims that these models are much better at function calling non-built-in tools provided by the user although benchmark scores on benchmarks like Tau-bench show marginal improvements.OpenAI releases GPT-4.1 as an API only model
A non-reasoning model with 1M context focused purely on developers it seems. They needed a stronger answer for code IDEs like Windsurf and Cursor to models like Gemini 2.5 Pro and Claude Sonnet 3.7. With this model they're bumping SWE-bench verified performance compared to GPT-4o from ~25% to 50%. Tools like Windsurf and Qodo were even explicitly mentioned in their launch blog post. On pricing, GPT-4.1 is significantly cheaper than o3: gpt-4.1 costs $2.00 per 1M input and $8.00 per 1M output whereas o3 charges $10 and $40 respectively. And o3 of course being a reasoning model generates more reasoning tokens also.
With the GPT-4.1 announcement they also announced two new long context benchmarks: MRCR and GraphWalks
Our friends over at Latent Space hosted two OpenAI employees discussing the model.Gemini 2.5 Flash launched with thinking budget controls
What is particularly interesting with this launch is how well it's positioned on the cost/quality Pareto front. See the blog post for the chart showing how well it trades off cost/quality wrt other major available models. This is of course the cheaper sibling of Gemini 2.5 Pro, if you haven't seen that model, check that one out first to contextualize this launch.
OpenAI introduces browser use benchmark BrowseComp
Interesting benchmark as browser use focused agentic use cases like Manus, Deep Research and Operator become more dominant.
Grok 3 is now available on their API
There's also a grok-3-mini which is significantly cheaper but seems to hold up performance in coding fairly well (53.3% vs 49.3% on Aider Polyglot). Although it does struggle with the Aider diff format which means it's a bit more verbose.
They introduce a reasoning model and a "deep reasoning" model they call a "rumination" model. This one integrates search tools in the reasoning process and it was RL trained to be scored on its ability to do this specifically. Very similar to the o3 and o4-mini models that are known to be able to use built-in OpenAI tools like web search during the reasoning process.
LM Arena introduces the Search Arena: Evaluating Search-Enabled AI
We've all been getting more used to consumer-focused offerings like ChatGPT/Claude/Grok/Gemini increasingly using search indexes to ground their answers through RAG. But how well do models perform at integrating this knowledge? LM Arena introduces an eval showing performance across user submitted queries by, LM Arena style, pitting models against each other and letting the user vote on the results.
DataDecide: How to predict best pretraining data with small experiments
Another great release by AllenAI further democratizing pre-training. DataDecide helps researchers decide on pre-training dataset selection without having to pre-train the entire model first.
Microsoft AI post-trains DeepSeek R1 to align with Western values
They state they've compared it to Perplexity's recent similar R1 finetune R1-1776 and show it performs better in several areas. Given this recent US government published report criticizing DeepSeek this comes at an interesting time.
📦 Repos
Transformers backend integration in vLLM
Hugging Face and vLLM have always had a more or less "loose" coupling where vLLM could use parts of it, like pulling weights from the Hugging Face Hub automatically. Now Transformers backend has been more fully integrated into vLLM making it simpler to flexibly define models using Transformers and host those models directly with vLLM.
The key unlock is to be able to go from more exotic ideas on the model side to running it for inference. Likely, performance won't be incredible but more likely than not acceptable to run some inference deployments for testing with real-world users. An example provided in the blog post is the Helium model from the Kyutai team, which wasn't supported by vLLM directly, and otherwise could not have run in vLLM without "porting it" to the vLLM list of supported models.DeepCompile: Unlocking Compiler Optimization for Distributed Training
DeepCompile from the DeepSpeed projects introduces a compilation step that combined with the existing ZeRO-3 configuration can boost training throughput by 30%-50% (compared to only ZeRO-3). Continued work of impressive open source contributions to the LLM training stack.
OpenAI announces open source Claude Code alternative codex: a command line coding agent
Interesting move for them to release this as true open source (Apache 2.0) versus Anthropic's closed Claude Code project that is heavily obfuscated and does not take outside contributions (which codex does do, see the active GitHub PR/issue section). Zig when they zag?
In my personal testing I've noticed it is slightly awkward in calling tools as everything seems to go through Shell commands and that led to some unnecessary git-patch-apply formatting errors causing many more output tokens to be used and it taking much longer to apply edits. However, I expect this thing to quickly get better.
A neat description of what codex is and can do today can be found in this blog post by Simon Willison.
Already the open nature of the project has spawned folks to write proxies to easily allow other model providers to be used with codex, see this project.SWE-agent Remote Execution Framework
Interesting approach to speeding up agents that depend on code execution through massive parallelization.
📄 Papers
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
Interesting long context work done by NVIDIA on long-context training.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Notoriously difficult to evaluating web browsing agents but researchers from McGill (with more contributors) have put together an interesting approach with AgentRewardBench, also check out their Hugging Face space explaining their approach interactively.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
I've mentioned this idea a couple of times already and I'm really excited to see more evidence of this strategy working: a post-training stage where tool use is enabled during the reinforcement learning phase. This allows models to learn in a much more realistic setting how they can best make use of tools during inference. Two quotes from the paper:
>Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%.
>Further analysis reveals emergent behaviors such as code self-correction, signaling an "aha moment" in which the model autonomously masters adaptive tool use.Concise Reasoning via Reinforcement Learning
One drawback of reasoning models is that they are verbose in their reasoning CoT requiring many output tokens for a given task. They find in this paper that it can be shown mathematically that there is a length bias in some of the RL algorithms used for RL training. However, they mention in the paper they didn't analyze this effect for the GRPO algorithm used by DeepSeek R1 and other recent reasoning model releases.
Claude 3.7 Sonnet shows reasoning and it's pretty clear that it's a lot more succinct than some of the open source reasoning models like QwQ. So I suspect a lot of effort will go into "conditioning" the RL CoT to be more dense without affecting the quality boost the RL CoT provides.d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
A collaboration between Meta and UCLA shows that diffusion models can be post-trained to perform reasoning like autoregressive models can and benefit from the reasoning process to improve quality on reasoning-focused benchmarks like Sudoku, GSM8K, MATH500.
FramePack lets you generate super long videos on modest GPUs by cleverly compressing less important frames to save memory. Awesome work from two researchers at Stanford.
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Interesting analysis by researchers from the Mila Quebec AI Institute on R1. They find a shortcoming of R1 is that it ruminates on previous answers too much, limiting exploration of new ideas (the crucial unlock of long CoT RL).
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
A more "guided" approach to teaching models how to perform Multi-Step Tool Use is to generate synthetic data showing trajectories that the model should be able to perform. This paper from the Stanford AI research group shows their approach and validates it on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA.
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
To make LLMs more efficient various kinds of pruning have been explored, mostly focused on the post-training phase. This paper explores pruning during pre-training unifying it into a single phase, increasing efficiency of LLMs as a result. They try to capture scaling behavior for various levels of pruning to help researchers understand optimal configurations for this new pre-training-infused pre-training approach.
📚 Resources
CaMeL offers a promising new direction for mitigating prompt injection attacks
Simon Willison has been warning about prompt injection risks for a very long time and he analyzes a new approach called CaMeL. He's optimistic that their approach can help avoid the risks of agents being prompt injected by attacker text somehow making it into the agent's prompt. This work is similar to this work from CS researcher Erik Meijer where he introduces the notion of verifying formally AI workflows.
They show features in the R1 reasoning model like the ability to perform backtracking, self referencing, entity tracking and features that get triggered right before the result of a calculation is generated. They've also just raised $50M led by Menlo Ventures, so likely their technology is proving to be a valuable set of tools in understanding LLMs with the purpose of making them perform better (and probably better aligned).
Stagehand launches a Model Evaluations page
Stagehand is an open source project by the Browserbase company for building browser using agents. This eval gives a good sense of performance across various models. They highlight winners on speed, cost and accuracy. Those are llama3-70b-8192, gpt-4.1-nano, and gemini-2.0-flash, respectively.
The Second Half of AI: now that RL works we can shift focus to model utility
Interesting essay by an OpenAI researcher about now that pre-training has created the right priors for RL to work that we should collectively focus more on utility through improved evaluation environments (that can be directly trained on through RL).
Want more? Follow me on X! @ricklamers
It is interesting how the output tokens for thinking on the latest Google flash model are substantially more expensive. Surely that is not a cost thing but rather matching the pricing of the other models?