Gemini Exp 1114: overfitted to benchmarks or new king?

Nov 17, 2024

📰 News

Athene-V2: Advancing Beyond the Limits of Scaling with Targeted Post-training
Fine-tuning for function calling/Agent use cases seems to be getting more attention. With more powerful models to tune from (Qwen 2.5 72B in this case) the feasibility increases, even when operating on a budget.
Google's new TPU: Trillium (v6e)
Fun to explore on GCP with JAX.
Supermaven joins Cursor
Normally I don't feature M&A activity in AI but this one in particular is interesting given the momentum of Cursor for AI-assisted coding. This will strengthen their lead and lead to even more momentum. If you haven't tried Cursor, now is _really_ a good time to start getting familiar.
4th edition of Conference on Lifelong Learning Agents (CoLLAs)
Scheduled for Aug 11, 2025. Defined as "systems that can continually learn throughout their lifetime".
Gemini 1114 breaks (almost) all records
It seems to outperform Sonnet 3.5 (Oct) in certain cases, even outperforming o1-preview in some cases when prompted to using CoT. Rumored to be "Gemini 2". It didn't pass all of my vibe questions so I'm yet convinced this model is a clear #1.
LMSYS Arena update: Gemini Exp 1114 takes #1 spot overall
More on the Exp 1114 release. It scores well in Arena but we know that isn’t the full story. Leave in the comments how well it works for you!

📦 Repos

fixie-ai/ultravox-v0_4_1-llama-3_1-70b
A Llama 3.1 70B backbone for Whisper encoder to fuse speech-to-text into the language model itself, reducing the need for orchestration.

📄 Papers

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
Test-Time for the ARC challenge. Interesting ideas by a team from MIT.
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Adaptive quantization and efficient low-precision implementations can push the needle on efficiency. Paired with the Scaling Laws for Precision paper this presents an interesting push on the frontier of efficient AI systems.
Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks?
TLDR "Strikingly, we find that many models are remarkably thread-safe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows."
Scaling Laws for Precision
"For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal."

📱 Demos

X-Portrait 2: Highly Expressive Portrait Animation
For those keeping track of SOTA performance transfer in video/image-to-video here's the latest from ByteDance, it does a lot better than ACT-One, RunwayML's very recent flagship release, in the examples shown.
RMBG-2.0 for background removal by BAAI
Neat open source model for a practical task like BG removal.

🛠️ Products

Bolt.new create any Web App using agent prompting
Claude Artifacts on steroids! Instant deploy too. Neat product. No affiliation.

📚 Resources

Dwarkesh Patel interviews NLP researcher Gwern
Letta introduces tool use with constraints
"TerminalToolRule(tool_name=...) - If the tool is called, the agent ends execution, InitToolRule(tool_name=...) - The tool must be called first when an agent is run. ToolRule(tool_name=..., children=[...]) - If the tool is called, it must be followed by one of the tools specified in children"

Interesting ideas for steer-ability!
llms.txt spec by Answer.AI: plaintext docs for AI
Anthropic already implemented it https://docs.anthropic.com/llms-full.txt / https://docs.anthropic.com/llms.txt I wonder how folks will go about prompt injection issues with this. I guess it comes down to trusting authorities/domain names.
Effect of quantization on various LLMs including Qwen2.5-Coder-32B-Instruct
Spoiler: Qwen2.5-Coder-32B-Instruct degrades surprisingly little when quantized to lower bit representation. Interesting in light of the 'Scaling Laws for Precision' paper in this week's Papers section.
Can AI Scaling Continue Through 2030?
An investigative report speculating on the bottlenecks to continued scaling of the key components of modern AI. "We identify electric power, chip manufacturing, data and latency as constraints."
Stripe creates APIs specifically for agent financial actions
Virtual Credit Cards for your agents? Awesome AI forward features shipped by Stripe.
Speculations on Test-Time Scaling (o1) by Sasha Rush
Sasha is a professor at Cornell Tech and works at Hugging Face.

Want more? Follow me on X! @ricklamers

Coding with Intelligence