Mixtral 8x22B Instruct: the race for efficiency is on

Apr 17, 2024

📰 News

WizardLM 2: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B
Outperforms GPT-4-0314, nears Claude Sonnet/GPT-4-1106-Preview.
Grok-1.5 Vision model
Idefics2: A Powerful 8B Vision-Language Model for the community by Hugging Face
Competitive even with LLaVa-NeXT-34B on most VLM benchmarks. It's a Mistral 7B derivative.
Mistral releases Mixtral 8x22B Instruct model
Native function calling, MMLU 77.75%, strong reasoning ARC. Impressive position on the Pareto front 👇

Meta shares some details on their next-gen inference chip: MTIA v2
Seems unusable outside of Meta though 🤷‍♂️
Qwen releases CodeGen models 1.5B and 7B size
7B scores 83.5 on HumanEval and 78.7 on HumanEval+. It scores above DeepSeek-Coder's 7B model, which was considered the best open model for the weight class.
Reka Core: multi-modal LLM
Claimed to outperform Claude Opus on multi-modal tasks. Technical report, and summary.

tl;dr Building your own GPTs but need more features & control? Readers of CoWI get early access by signing up here.

📦 Repos

STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
Nice demonstration of how one would generate a wikipedia-style article using web search RAG. Authored by Stanford researchers.
🦍 GoEx: A Runtime for Autonomous LLM Applications
Based on this core insight "We argue that in many cases, "post-facto validation"—verifying the correctness of a proposed action after seeing the output—is much easier than the aforementioned "pre-facto validation" setting. The core concept behind enabling a post-facto validation system is the integration of an intuitive undo feature, and establishing a damage confinement for the LLM-generated actions as effective strategies to mitigate the associated risks."

📄 Papers

Best Practices and Lessons Learned on Synthetic Data for Language Models
Google DeepMind on the use of synthetic data for LLMs
SEO for LLM powered search engines: strategic text sequence (STS)
Perplexity can start playing cat & mouse already.
Diffusion Models for Video Generation
By Lillian Weng who’s at OpenAI.
TransformerFAM: Feedback attention is working memory
Explicit working memory for infinite context Transformers. By Google DeepMind.
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper by Meta. Mainly seeks to address the quality shortcomings of other proposed nonquadratic complexity sequence models like state space models. They achieve better performance than the 7B Llama 2 Transformer based architecture under equal data and parameter budget.
Compression Represents Intelligence Linearly
Compression level ~= benchmark scores ~= perplexity. We can conclude, these models are learning something. Throwback to this lecture from Ilya Sutskever.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
“We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs.”

📱 Demos

🛠️ Products

Cohere Compass Beta
New multi-dimensional embedding search. Interesting retrieval approach.
Stability launches Stability Diffusion 3 Turbo API
Generate all the images.

📚 Resources

Walkthrough: Vertex AI Agent Builder
tl;dv it's similar to OpenAI's GPT Builder.
LLM Transparency tool by Meta
Very helpful for debugging how Transformer-based models reach their token probabilities.
OpenAI releases Batch API for 50% discount async inference
Speech-to-Text leaderboard by Artificial Analysis
Universal-1 by AssemblyAI has the lowest WER, but Whisper at fal.ai is fastest at reasonable WER.

Want more? Follow me on X! @ricklamers

Coding with Intelligence