📰 News
Agentless: Demystifying LLM-based Software Engineering Agents
“Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (27.33%) and lowest cost ($0.34) compared with all existing open-source software agents!”
This paper opens up an interesting discussion around how we should be quantifying more rigorously what agentic loops do to improve performance, and in which cases it simply increases inference duration, inference cost and code complexity without increasing quality measurably.Kyutai demos and launches Moshi: a GPT4o-like speech model
The demo is up, but isn't quite ready for prime time yet
Intel Shows OCI Optical I/O Chiplet Co-packaged with CPU at OFC2024, Enabling Explosive AI Scaling
📦 Repos
Apple releases 4M: Massively Multimodal Masked Modeling
Check out the video and HF demo! Developed in collaboration with EPFL.
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
GraphRAG: graph structured RAG
Basic idea is to extract structured information from unstructured data and to use knowledge graph querying techniques at LLM inference time to populate the context-window with relevant information to answer the user's query.
📄 Papers
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Interesting context for comparing xLSTM vs Mamba vs Transformers for vision sequence modeling tasks.
An Investigation of Incorporating Mamba for Speech Enhancement
State Space Models finding more applications in the audio domain. Noisy sample and the cleaned up sample.
CELLO: Causal Evaluation of Large Vision-Language Models
Yan LeCun has often argued that LLMs or VLMs don’t “really” understand the world and hence fail to apply even basic physics principles. Perhaps this benchmark will help us measure to what extent scaling, architecture innovations and data quality improvements help with causal reasoning in the visual domain.
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, Yi-200K, GLM-4-1M, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
“Train only task-relevant experts for LLM customization, reduces storage by up to 90% and training time by up to 30%. Customizes LLMs efficiently, nearing Full-Parameter Fine-Tuning (FFT) performance (50.2 vs 51.0), retains high performance in Math and Code tasks (39.8 vs 40.5) compared to FFT (31.5) and LoRA (28.5).”
Ctrl-G: Adaptable Logical Control for Large Language Models
Tight control during generation allows smaller models to reach competitive results to larger, slower, more expensive models. There's also a GitHub repo and UI demo its capabilities is in the works.
📱 Demos
Florence 2 VLM running in the browser
Very cool! Which edge cases could be built with this?
📚 Resources
Meta just dropped weights for “Better & Faster Large Language Models via Multi-token Prediction”
Gradually, then Suddenly: Upon the Threshold
Very nice opinion peace by Ethan Mollick. He outlines, in my opinion, one of the most useful frameworks for thinking about progress in generative AI. The progress in capabilities can be viewed as breaking through discrete capability boundaries that once crossed, delegate an entire subclass of tasks to be “completely solved” by AI.
GPT4All: local LLMs powered by llama.cpp + integrated embeddings/local RAG
Want more? Follow me on X! @ricklamers