Groq delivers Mixtral at 500+ (!) tokens per second

Feb 21, 2024

📰 News

Groq LPUs run Mixtral at 500+ (!) tokens per second
Mind blowing performance. Groq was founded by Jonathan Ross who began Google's TPU effort as a 20% project.
OpenAI Sora: text-2-video to build a world model
Some interesting notes in their blog post about emerging abilities of scaling up their text-2-video pipeline.
Google releases Gemini 1.5: 10M text tokens
It can learn to speak an entirely new language by feeding in an entire book about the language's grammar and background in the context window. Gemini Pro 1.5 will have 128K tokens similar to OpenAI GPT-4 Turbo.
Google releases 7B Open Source LLM called Gemma
Interestingly it’s released on Kaggle in addition to Hugging Face. Both pre-trained and instruction tuned variants available. It also comes in a 2B flavor. Seems to significantly outperform Llama 2 (Gemma / Llama 2): MMLU (64.3/45.3), HumanEval (32.3/12.8), HellaSwag (81.2/77.2). See blog post for more details.
V-JEPA: The next step toward Yann LeCun’s vision of advanced machine intelligence (AMI)
Learns world model by patching out areas of video frames. Open Sourced the code & weights on GitHub. Much more sample efficient than prior work. Very interesting!
Mistral Next is available on LMYSYS Chat
Initial vibe check shows it outperforms Gemini Ultra.
$300M for TII - the research institution that developed Falcon 180B
Could be a promising source of open models moving forward. It's always good to have competition.

tl;dr Building your own GPTs but need more features & control? Readers of CoWI get early access by signing up here.

📦 Repos

Magika: AI powered fast and efficient file type identification
Tiny neural net for filetype identifications. Runs in milliseconds on CPU. Repo here.
EasyKV: intelligent cache eviction for efficient LLM inference
EasyKV support various cache eviction policies and crucially includes RoCo: robust cache omission policy. It's a policy proposed by the authors that performs best across various text summary benchmarks.
minbpe: Karpathy releases minimal BPE tokenizer
OpenRLHF: DPO, KTO, Mixtral support
Fine-tune at cluster scale.
Cursor’s prompting library Priompt
It’s on GitHub. tldr use priorities to intelligently render a final prompt given a token budget.

📄 Papers

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Our method enables serving the LLaMA-7B model with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.
They achieve this by intelligently quantizing the KV cache. A KV cache avoids recomputation of the computed keys and values of previously processed tokens. Simple KV cache explainer.
Automated Unit Test Improvement using Large Language Models at Meta
Interesting nugget from the paper: in the Nov 23' Test-writing Hackathon at Meta the LLM landed 6th place in the contest of who could add the most tests. So outperforming some human FAANG engineers in value add. Coders beware.

📱 Demos

🛠️ Products

LlamaIndex launches commercial offering LlamaCloud
Retrieval as a service. LlamaParse is an interesting product for ingesting documents that contain structured data.

📚 Resources

NeuralFlow: visual debugger of Mistral 7B
You can see patterns in the activations that hint that your fine-tune has gone bad. Cool idea!
Gemini 1.5 performance on a real-world 120K token code understanding task vs GPT-4
Spoiler, it doesn’t look good for GPT-4.
Karpathy explains tokenization
I’ve noticed OpenAI’s tokenization and complex string handling (white space, foreign characters, unicode, emoji) is excellent. I guess this masterclass will go a long way in bringing you up to speed on best practices. Lots of gotchas in tokenizers that could invalidate costly training runs or put a hidden ceiling on your inference quality. Nice complimentary tokenizers playground: https://huggingface.co/spaces/Xenova/the-tokenizer-playground

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence