📰 News
Jamba 1.5: hybrid SSM+Transformer Efficient Long Context Models
Diffusion Models Are Real-Time Game Engines
This is a crazy result, I’m curious to see what this implies in the limit. Yann, is this our world model?
Mozilla/Justine Tunney releases 1.58 bit LLM support in llamafile (CPU inference)
📦 Repos
Liger-Kernel Medusa head training
Created by the LinkedIn engineering team.
Salesforces releases more "Large Action Models" on HF
MoE, 32k context, why these are not called function calling models beats me. Still doesn't allow commercial use.
📄 Papers
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Generative Verifiers: Reward Modeling as Next-Token Prediction
I like the idea of simplifying reward models.EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Not One Vision Encoder to Rule Them All I guess! Work by NVIDIA et al.
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
DeepSeek strikes again! Stacked with GPU-poor-ish innovations to scale up large-scale model training.The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Two SSM distilling papers this week, MOHAWK from CMU and this from Cornell/ex-Stanford/Princeton and University of Geneva.
Nous Research: DisTrO (Distributed Training Over-The-Internet)
Interesting attempt at “folding at home style” distributed model training. Code is on the way I’ve heard.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
📚 Resources
Ape - your first prompt engineer
Anthropic is leaning into prompt generation support in their console, this might be the open source version of that.Berkeley Function Calling Leaderboard V2
Looks promising! I’m taking a closer look next week, more to follow…Cross-Architecture Distillation Part I - The MOHAWK Framework (Transformer->SSM)
[Video] Interview with Jürgen Schmidhuber – the father of generative AI
Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
Anthropic adds system prompts to docs
Love how they’re like “you can prompt jailbreak extract them anyway, let’s own that they are public” instead of pretending they’re invisible to users.
OpenDevin rebrands to OpenHands
The Devin competitor (the OSS team is now a company too).
Llama 3.1 405b bf16 base model
Some AI sommeliers apparently really dig what you can get a bf16 precision Llama 3.1 405b base model to do. If you find out cool examples please share (as GitHub gist) and I'll promote! Hosting curtesy of Hyperbolic.
By the excellent Trelis Research
Want more? Follow me on X! @ricklamers