$113M in Seed funding for Mistral AI to create Open Source LLMs

Week 25 of Coding with Intelligence

Rick Lamers

Jun 21, 2023

📰 News

Pitch memo that raised €105m for four-week-old EU LLM startup Mistral
This memo provides an interesting glimpse into how the LLM landscape is likely to evolve. Some things that stand out to me: they’re not planning to build their own cluster, which could be a massive competitive disadvantage compared to Google (TPU pods) and OpenAI’s position (MSFT owns clusters and OpenAI providing a structural cost advantage).
OpenAI has talked about open sourcing some of their non-best-in-class models, if this is done it effectively eliminates a key hiring argument for Mistral as they too are planning to keep their best model(s) proprietary.
Content deals for proprietary training data are troublesome combined with open weights. Hackers have shown to be able to circumvent alignment measures and this probably means all that content can easily be extracted from the open models.
The core premise of having a leg up with EU enterprises for data privacy reasons might stand the test of time as data is further becoming the differentiating factor in model performance as evidenced by papers like “Textbooks Are All You Need”.

Nevertheless, great to see that all the LLM players are keeping each other on their toes!
Meta introduces a foundational model for speech synthesis
Founders previously at Meta/Deepmind and co-authors of the Chinchilla LLM found Mistral AI and raise monster $113M seed round at a $260M valuation to take on OpenAI
LangChain lands support for OpenAI GPT-3.5 and GPT-4 Functions feature

📦 Repos

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Authors (UC Berkeley) claim “vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x”. Check this out if you’re doing inference on OSS models!
OpenLLaMA 13B trained on 1T tokens of RedPajama dataset
Given that LLaMA likely outperforms Falcon this model might be the best OSS model available for commercial use at the moment. Key limitation however is that it can't be used for code generation due to whitespace handling issues.
ReLLM: Exact structure out of any language model completion.
Step 1: let a model write a regex for you that matches the output you desire. Step 2: use ReLLM to force the model at prediction time to only generate allowable tokens. Super cool idea, I'm curious whether this "pre-generation filtering" approach affects performance too much. Would be good to see benchmark evals against ReLLM.
WizardCoder 15B open source code LLM released: achieves 57.3 pass@1 on HumanEval
This is especially interesting because code generation has been particularly costly to do on metered proprietary APIs.
OpenLLM by BentoML - An open platform for operating large language models (LLMs) in production
Hosting LLMs can be a bit tricky. BentoML has a track record of providing excellent and intuitive APIs for machine learning serving. I would check this one out.
pgvector - Open Source vector similarity search for Postgres
I hear a lot of folks are successfully utilizing this. Probably won’t get you to the same scale that dedicated vector databases like Weaviate and Pinecone can handle.

📄 Papers

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
A new gradient descent algorithm that supposedly drastically outperforms current best in class algorithms like AdamW and NadamW. They specifically evaluate on language models.
Textbooks Are All You Need - 51% on pass@1 HumanEval with just 1.3B parameters and 7B tokens (!)
Other LLMs that score above 50% on HumanEval require at least 100x the amount of data and 10x the model size.
Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness
Distributed training of LLMs is the norm, but how can one make it scale to ever more nodes? This note from Eric Zelikman et al. comments on tricks that can be used to reduce the amount of data sharing required.

📱 Demos

AI Speech Classifier by ElevenLabs
With the quality of text-to-speech increasing to human levels it's easy to get fooled. Great to see vendors of text-to-speech engines helping in the fight to discern real from fake audio clips!

📚 Resources

LLaMA likely outperforms Falcon despite HF OpenLLM leaderboard
Good evidence that unbiased evaluations are still difficult to come by. Francis includes code to reproduce the MMLU evaluation score which is nice to see.
HF OpenLLM leaderboard
An approximate guide for navigating the open source language model landscape.

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence

$113M in Seed funding for Mistral AI to create Open Source LLMs

Week 25 of Coding with Intelligence

Discussion about this post