Medusa successor Hydra boosts decoding performance by 1.31x & Reka AI lab releases multimodal model Reka Flash

Feb 14, 2024

📰 News

Reka Flash: An Efficient and Capable Multimodal Language Model
New proprietary multimodal language model from the Reka lab. Researchers from DeepMind, Google, Baidu and Meta started the company with funding from Snowflake, GitHub's CEO Nat Friedman and venture capitalists DST Global Partners and Radical Ventures. It’s competitive on visual tasks with GPT-3.5 / Gemini Pro based on GPQA score. Playground available with image and video upload/chat.
MetaVoice 1B: hyper realistic tts model
It supports voice cloning too!

📦 Repos

DeepMind's SELF-DISCOVER implementation
The author runs the technique against 'miqu' the Mistral-Medium leak.
OpenAI Batcher (oaib)
A Python library for making rate-limited, async batch requests to the OpenAI API.
Eleuther evaluation harness
Became industry standard, powers HF Open LLM Leaderboard.
BoCoEL: Bayesian Optimization as a Coverage Tool for Evaluating Large Language Models
From their repo: LLMs are expensive and slow behemoths, and evaluating them on gigantic modern datasets only makes it worse.

If only there is a way to just select a meaningful (and small) subset of the corpus and obtain a highly accurate evaluation … Wait, sounds like Bayesian Optimization!
Function calling model: functionary
Repo includes a bunch of useful resources like sample packing for efficient fine-tuning.

📄 Papers

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Significantly faster decoding. Hydra achieves up to a 1.31x and 2.71x improvement in throughput as compared to Medusa and baseline deco. Paper + Repo available.
Suppressing Pink Elephants with Direct Principle Feedback
Neat application of RLAIF as an adaptation of Anthropic’s Constitution AI. So if you really want your chatbot to adhere to certain instructions definitely take a look at this paper.
RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation
Seems to outperform LoRA quite significantly and approximates FFT (Full fine-tuning) in some cases. Compatible with quantized base weights.
World Model on Million-Length Video and Language with RingAttention
Strategy for scaling to 1M tokens of context. Open sources two new 7B models: LWM-Text, LWM-Text-Chat. Paper by UC Berkeley. Fascinating work and immense potential to start modeling video and text jointly.
Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings
Interesting paper that proposes a solution to the problem of overfitting to low-complexity training data. Generalization is all you need ;-)
SPIN: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
“SPIN even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data”

📱 Demos

BUD-E: very low latency voice enabled chat by LAION (open source)
At around 500ms it seems very interactive. Check out the video and try running it locally if you’re feeling adventurous! The video shows BUD-E running on a system with an NVIDIA 4090 GPU. The article outlines various low hanging fruit improvements that can be made to improve the latency and quality even further.

🛠️ Products

📚 Resources

Matryoshka embeddings: faster OpenAI vector search using Adaptive Retrieval
Interesting article from Supabase showing how to take advantage of the new Matryoshka style embeddings from OpenAI. Its two-stage approach allows you to trade off speed/accuracy by first looking at a subset of the embedding components (e.g. 256 dimensions) and for the final filtering using the full embedding size (3072).
Visual Mixtral diagram

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence