📚 Resources
Extract full logprobs from OpenAI API
Neat trick!
The blog post contains some interesting details about performance for various SSM based models (Mamba).
Well deserved recipient of a16z Open Source grant. Follow what he's working on, much of it is very interesting. In particular, this repo that summarizes innovations for the transformer architecture is very cool https://github.com/lucidrains/x-transformers
Experiment of influence different chat formats on Mixtral
Seems to jailbreak out of alignment, indicating how alignment techniques are probably brittle
YouTube: 15min History of Reinforcement Learning and Human Feedback
By Nathan Lambert
Another architecture from Haze Research: Based
The excellent Zoology blog post series by Chris Ré's lab at Stanford continues.
JSON Mode and Function calling on top of Open Source LLMs by Anyscale
Extremely useful addition to "just" hosting Llama/Mistral models. This is a huge unlock for building on Open Source LLMs.
📰 News
📦 Repos
LLMLingua: Input compression scheme for cheaper & faster inference
By Microsoft. Like Mark Twain said: if I had more time I would have written a shorter letter.
Axolotl: fine-tuning framework
Podcast interview with the author: https://www.latent.space/p/axolotl. Author also received a16z grant.
PowerInfer: Llama.cpp competitor claims 11x speedup
There's also a paper: https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf
📄 Papers
HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts via HyperNetwork
Improving language models by retrieving from trillions of tokens
I think it’s interesting to revisit more exotic retrieval ideas to see how we can make models better through better synergy between training and inference time retrieval. This is an older paper but worth reviewing.
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Promises Transformer-level performance without quadratic scaling in sequence length. Because this is much more similar to the original Transformer architecture adoption could be much faster than SSM or Hyena/Monarch based architectures. On of the authors is Jürgen Schmidhuber, controversial yet widely regarded as deep expert in neural networks.
Apple R&D team experiments with faster on-device LLM inference through efficient flash memory access
In some cases resulting in 20-25x performance increase on GPU inference.
Transformer Memory as a Differentiable Search Index
Mentioned by the Cursor founder as a way to perform retrieval on codebases.
📱 Demos
🛠️ Products
Want more? Follow me on Twitter! @ricklamers