This week is DOUBLY packed since I couldn’t get last week’s release out in time due to crunch period, I hope Llama 3.1 native tool use at Groq speeds (🏎️) makes up for it.
📰 News
Native tool use/function calling for Llama 3.1 lands on Groq API
I've worked on this personally, let me know if you have any thoughts! @ricklamers on X. So all model IDs with 3.1 in them. I even found a small trick to allow parallel tool calling while that’s not natively supported by the Llama 3.1 spec (you can disable this if you want using the OpenAI ‘parallel_tool_calls’ boolean in the chat completion payload).
Answer.AI releases answerai-colbert-small-v1
Beats bge-small-en
Dream Machine 1.5 released
They claim improved image-to-video performance, better text support and better prompt adherence.Groq raises $640M in funding (Series D) to accelerate AI inference
Let’s go TEAM!Grok != Groq, impressive model on paper but limited rollout as of yet.
Anthropic launches prompt caching for Claude
Not too dissimilar to Gemini’s recent launch of context caching, although details differ (Claude has a timeout of 5m).
Sakana AI releases AI scientists
Most arguments against the ability of LLMs to create net new knowledge dismiss an important direction which is programs guided by LLMs (agents) collecting primary data to execute hypothesis testing at scale. I’m very bullish on this direction being presented by Sakana.
MultiOn claims breakthrough on agentic performance with Agent Q
The blog post links to this paper. As MultiOn is not open sourcing the work directly I would take the results with healthy skepticism and evaluate it for yourself. Notably, on of the authors (Rafael Rafailov) of the Stanford DPO paper contributed to the Agent Q project.Very impressive looking images, text and increased level of control like color palette or style (realism).
📦 Repos
Flash Linear Attention in Triton
Eg Mamba2, RWKV6
📄 Papers
By Contextual AI, alternative to KTO and DPO.
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
📱 Demos
Runs in-browser using WASM, neat!
📚 Resources
FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention
Cool optimization scale invariance result contextualized by Simo Ryu
Here’s the related paper he cites.Databricks compares long-context performance across various SOTA models
GPT-4o seems to rule on unseen long-context Q&A.
Deep learning for dummies. All the practical details and useful utilities that go into working with real models.
Want more? Follow me on X! @ricklamers