Did State Space Models find their niche? 135ms latency for text-to-speech
Week 23 of Coding with Intelligence
π° News
Mistral announces MNPL license; no commercial use
It's inevitable companies training large scale models need to recoup their costs in some way, this looks like a healthy compromise of openness and necessity to monetize their best creations.
Key Hyperscalers And Chip Makers Gang Up On Nvidiaβs NVSwitch Interconnect
The Ultra Accelerator Link consortium is forming to challenge Nvidia's dominance in GPU interconnects, just as the Ultra Ethernet Consortium did with InfiniBand.
Claims to be competitive with Llama 3 8B and GPT-4 Vision.
Yuan 2.0-M32: Mixture of Experts with Attention Router
"Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively."
π¦ Repos
π Papers
The main point of the Mamba-2 paper is what we call structured state space duality (SSD): a new kind of layer like an Attention layer (but read the blog post for more detail).
Diffusion On Syntax Trees For Program Synthesis
TLDR; They teach neural models to *edit* programs, informed by the execution output.
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models
π οΈ Products
Cartesia AI Sonic: 135ms latency TTS API
Voices are pretty impressive, approaching ElevenLabs quality. It uses a State Space Model (like Mamba).
Perplexity Pages: a new kind of blogging?
Will be interesting to see hallucination rates and general search & index-ability of these pages.
π Resources
Kullback-Leibler is All You Need
TLDR: Kullback-Leibler (KL) divergence minimization is the core objective underlying many modern machine learning methods, providing a universal recipe to rederive objectives like VAEs, Bayesian inference, and more. Associated talk: https://slideslive.com/39014672/information-theory-for-representation-learning
Speed of light based theoretical limits for LLMs
Interesting napkin math to "help validate the quality of implementations and predict the impact of architectural changes"
What We Learned from a Year of Building with LLMs (Part II)
Part II, warning, this one is pretty text heavy.
Want more? Follow me on X! @ricklamers