Era of 1-bit LLMs: LLaMA 3B performance with just 28% of the weights

Feb 28, 2024

📰 News

Mistral Large: behind GPT-4 and Gemini Ultra but slightly cheaper
As expected it’s not open source. Only available through their API endpoints and through Azure.

It’s priced at precisely 80% of GPT-4 Turbo.

Some of the benchmarks are a bit underwhelming, HumanEval comes in at 45% while GPT-4 Turbo is at 82% (while not a perfect benchmark, it’s a first degree approximation of the model’s coding ability).

They are ahead of Claude 2/Anthropic it seems, in terms of capabilities.

It's nice to see the CEO commit to open source and explain their decision making is mostly a result of their limited GPU capacity (supposedly just 1.5k H100s got us Mistral Large). We're truly living in a compute bound world.
EMO: Alibaba's latest Audio2Video is mind-blowingly good
Shockingly good emotive expression in faces with very convincing lip sync. Check out the videos!
Berkeley Function-Calling Leaderboard
And the release of a new model: OpenFunctions-v2 (based on DeepSeek-Coder 7B) (it seems to outperform FireFunction V1).
FireFunction V1: Open Source function calling model
Performs similar to GPT-4 for some aspects of function calling. Open Weights, commercial use allowed. Weights on Hugging Face.

tl;dr Building your own GPTs but need more features & control? Readers of CoWI get early access by signing up here.

📦 Repos

📄 Papers

Do Llamas Work in English? On the Latent Language of Multilingual Transformers
Interesting interpretation work on the Transformers' latent space (representations of data inside the model). tldr "abstract "concept space" lies closer to English than to other languages" so, yes, sort of.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Crazy result: lower perplexity and better task performance (ARCe ARCc HS BQ OQ PQ WGe) with only ternary (-1, 0, 1) values for the weights of the model (compared to LLaMA 3B LLM).
DoRA: Weight-Decomposed Low-Rank Adaptation
More attempts to close the gap between full fine-tuning and PEFT (parameter efficient fine-tuning).

"DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding"
LoRA+: Efficient Low Rank Adaptation of Large Models
ChatterBox: Multi-round Multimodal Referring and Grounding
Kosmos-2: Grounding Multimodal Large Language Models to the World
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
175B LLM on 300B tokens in < 2 days. 1.34x Megatron-LM on MFU (Model FLOPs Utilization).

📱 Demos

Globe Explorer
Neat demo showing the UX possibilities of search and discovery powered by LLMs and traditional retrieval

🛠️ Products

📚 Resources

OpenAI employee on the power of Compute Multipliers
He makes an excellent point on the value of compute multipliers in the context of scaling up. Great precursors to this article are Chinchilla compute optimal training and the bitter lesson by Rich Sutton
The killer app of Gemini Pro 1.5 is video
By Simon Willison
Variable naming for tensors by Noam Shazeer
He’s the cofounder of Character.ai and coauthor of the attention is all you need paper.
Maxime Labonne releases AlphaMonarch-7B
AlphaMonarch-7B is a new DPO merge that retains all the reasoning abilities of the very best merges and significantly improves its conversational abilities. Kind of the best of both worlds in a 7B model.
It benchmarks way above vanilla Mistral 7B Instruct, but is this just overfitting on benchmarks through model merging or is it really a better model? I'd say: run it for your use-case and check your evals!

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence