Subquadratic LLMs: will they unseat Attention Transformers?

Dec 20, 2023

📚 Resources

📰 News

📦 Repos

LLMLingua: Input compression scheme for cheaper & faster inference
By Microsoft. Like Mark Twain said: if I had more time I would have written a shorter letter.
Axolotl: fine-tuning framework
Podcast interview with the author: https://www.latent.space/p/axolotl. Author also received a16z grant.
PowerInfer: Llama.cpp competitor claims 11x speedup
There's also a paper: https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf

📄 Papers

HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts via HyperNetwork
Improving language models by retrieving from trillions of tokens
I think it’s interesting to revisit more exotic retrieval ideas to see how we can make models better through better synergy between training and inference time retrieval. This is an older paper but worth reviewing.
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Promises Transformer-level performance without quadratic scaling in sequence length. Because this is much more similar to the original Transformer architecture adoption could be much faster than SSM or Hyena/Monarch based architectures. On of the authors is Jürgen Schmidhuber, controversial yet widely regarded as deep expert in neural networks.
Apple R&D team experiments with faster on-device LLM inference through efficient flash memory access
In some cases resulting in 20-25x performance increase on GPU inference.
Transformer Memory as a Differentiable Search Index
Mentioned by the Cursor founder as a way to perform retrieval on codebases.

📱 Demos

🛠️ Products

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence