Dear readers,
Happy New Year and welcome to 2025! This weekโs edition is a collection of everything that happened in the final 2-weeks (51/52) of 2024, and BOY did it get busy during that final sprint of the year. If nothing else, I think it signals that 2025 is going to be an incredible year for AI. With democratization of frontier performance (DeepSeek-V3, QwQ, QVQ, Llama 3.3 70B, Qwen 2.5 72B), an incredible installed base of compute clusters (multiple interconnected 100k accelerator clusters, 1M clusters in the works), and new frontier heights (o3) that fully automate most run of the mill software engineering (71.7% on SWE-bench verified), the pace of progress is bound to be electric. Strap in and enjoy the ride!
- Rick Lamers
๐ฐ News
DeepSeek-V3: GPT-4o level open source model
671B MoE parameters, 37B active, pre-trained on 14.8T tokens. See Technical Report for more details. The takeaway here is that this launch is _drastically_ commoditizing frontier level models. It is significantly cheaper than GPT-4o (at 9:1 input output token it is 1/9th the cost). It claims to rival Claude Sonnet 3.5 but many evals (e.g. SWE-bench verified) still show Sonnet 3.5 (v2) beating it slightly. Training cost was allegedly around $5.5M (but maybe even cheaper because that's calculated at hourly prices of renting H800s).
o1-mini level performance. First truly open source reasoning model. Read the blog post for details. Apache 2.0 licensed.
QVQ: 72B vision reasoning model
Gets close to latest o1 (o1 2024-12-17) vision performance as measured by vision evals like MathVista. Uses 'qwen' license.
o3 does well on ARC-AGI-1 which many predicted would take a long time. As is tradition in machine learning, once a task has been mastered the goal post is quickly moved, and we're left searching for the next task that isn't solved. The AGI definition of MSFT for OpenAI of "excess of $100B in profits" seems most robust as humans are pretty good at trying to compete away profits by attacking high margin activities.
I'm looking forward to the next eval folks are targeting. It might be ARC-AGI-2 (on which o3 apparently gets 30%) or Frontier Math by Epoch AI (on which o3 gets about 25% atm). The more evals start to resemble utility in terms of useful for societal (economic) operating activity, the more the artifacts that beat them actually end up making a difference in practice (getting 71.7% on SWE-bench verified like o3 does means we can all automate a significant portion of software engineering work).
Folks have pointed out that o3-mini will be the more cost effective option (beating full o1 at several tasks) but since neither o3 or o3-mini are being made available not much attention has gone to it.Groq Appgen: instant web app generation on Groq
I've personally built this application ๐ (with contributions from my awesome colleagues Jose Menendez and Benjamin Klieger) showcasing the power of Groq speed (speculative decoding enabled achieves 2k tokens per second on 70B!) with the Llama 3.3 70B model for strong code generation capabilities. Check out this X video by my colleague Benjamin: https://x.com/benklieger/status/1870277109601771851
We've also open sourced the entire implementation https://github.com/groq/groq-appgenEpoch AI: Frontier models have likely gotten much smaller
Nice investigative work to show that frontier models are becoming smaller. We've long past the stage of a simple "1T models won't ever go into production". Efficiency = margin in the age of scaling AI to broad based use so the investment incentive to make models more efficient is enormous.
ModernBERT: a better BERT by Answer.AI
Good base model for finetuning your task specific embedding model, here's a list of finetunes folks created https://huggingface.co/models?other=base_model:finetune:answerdotai%2FModernBERT-base&sort=downloads
YouTube already licensing video data to model companies?
Haven't seen coverage of this. Neat find by @bilaltwovec.
Google DeepMind previews Gemini 2.0 Flash Thinking model
Noam Shazeer, original author of Attention Is All You Need has returned to Google and posted about this new model on X https://x.com/NoamShazeer/status/1869789881637200228
Read my full thoughts on the interview in this X ๐งต https://x.com/RickLamers/status/1874778471907344825
Google DeepMind announces Veo 2
Impressive video generation model by Google DeepMind. Expectations are generally that Google has a compute (TPUs) and data (YouTube) advantage and will really nail video generation. Will be interesting to see how they go from demo โ paid product (APIs on gcloud? YT creator features?).
Nebius Cloud team contributes SWE-bench agent using strictly open source models
Impressive 40.6% resolved rate on SWE-bench Verified. They use Qwen-2.5-72B Instruct and Llama 3.1 70B Base.
o1 models available through API
Interestingly they come with a control parameter called "reasoning_effort" that can be set to "low", "medium" and "high".
๐ฆ Repos
Mainly reveals Qwen is aggressive scaling (18T pre-training) funded by a big tech incumbent (Alibaba). Amazing gift to the community!
smolagents: code isolation/agent framework by Hugging Face
Wraps E2B, which you may have seen provides isolated code execution APIs.
๐ Papers
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Meta researchers explore video understanding with VLMs. Strong VLMs for video understanding could unlock another large amount of pre-training data by "transcribing" videos made in the real-world. Video recording is cheap (smartphones), visual understanding based transcription could yield important real-time/at scale data.
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Interesting new video generation controlling technique.
Bringing low-bit representations to image generation models. Very impressive efficiency gains by folks from ByteDance/POSTECH (SK research university).
"On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters." Meta continuing to advance architecture ideas, and kindly running experiments at scale to aid the GPU poor. Thank you team & Zuck!
๐ ๏ธ Products
Agentic product feature of Gemini Advanced, tapping Google's strong long context performance of their Gemini models. Think of it as LLM+web search on steroids. Neat launch!
๐ Resources
Beyond Decoding: Meta-Generation Algorithms for Large Language Models (Remote Talk)
Ideas for inference time scaling algorithms.
Building effective agents by Anthropic
A surprisingly balanced survey of patterns in agentic LLM applications. I think this is very close to the best understanding the leading framework developers have about agentic AI. The split of Workflows as control flow being handled in code and Agents where control flow is dynamically handled by the LLM really resonates with me. Furthermore, prompt engineering your tools is an underrated optimization strategy for getting better performance.
Simon Willison covers QVQ the latest Qwen visual reasoning model
"Iโve tried it out with a bunch of things, with mixed resultsโbut itโs really fun seeing how it works through a problem." is the vibe eval in case you're short on time.
Reasoning Model Evals: evals to show in which domains reasoning models improve results
By Arvind Narayanan et al. from Princeton. Neat attempt to identify where reasoning models help. But with all the action in reasoning models/inference-time-compute it's bound to get outdated quickly.
Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)
By your favorite AI youtuber Yannic Kilcher. This is Meta's paper about getting rid of tokens.
Interesting notes about several optimizations enabling fast FP8 training, use of Multi-Token Prediction (with ablations!) and interestingly R1 (their reasoning model) distillation for improved reasoning. Additionally they introduce a MoE routing collapse prevention technique they dub "auxiliary-loss-free load-balancing".
Scaling Test-Time Compute with Open Models
Awesome exploration of scaling test-time compute with open models by Hugging Face. "Check out this plot where the tiny 1B and 3B Llama Instruct models outperform their much larger 8B and 70B siblings on the challenging MATH-500 benchmark if you give them enough โtime to thinkโ ๐คฏ. Very cool result by team HF!
Want more? Follow me on X! @ricklamers
So good, Rick, thanks