Databricks' DBRX-132B-MoE outperforms Mixtral

Mar 27, 2024

📰 News

Databricks releases DBRX a 132B MoE that outperforms Mixtral
36B active parameters in forward pass, 32K context. License is unfortunately restrictive, check it out here and the Hugging Face Collection with the model weights. Trained on 12T tokens.
Gemini 1.5 Pro with 1M context access expanding
Try multi-modal inputs like video! It's pretty incredible.
UXL Foundation: a consortium to beat NVIDIA CUDA stronghold
Looks like Intel is opening up a set of 7 oneAPI projects that can combine to make it easier to write CUDA-like code for non-NVIDIA accelerator hardware. AMD is a surprisingly absentee on the list. Remarkably clear overview from this technical reporter at Forbes.
GitHub launches security scanning with AI generated fixes

tl;dr Building your own GPTs but need more features & control? Readers of CoWI get early access by signing up here.

📦 Repos

Marigold-LCM Depth Estimation
Faster implementation of Marigold depth estimation model using LCM.
Levanter: LLM distributed training library by Stanford research lab
Unique feature seems to be its ability to swap to different accelerators (TPU->GPU) mid SGD. Oh and named tensors through the Haliax lib.
TorchTune: A Native-PyTorch Library for LLM Fine-tuning
By the PyTorch team at Meta.
Leaping: debug Python instantly with an LLM debugger
Very cool project!
Lightning Thunder: source-to-source compiler for PyTorch
From the repo: "We achieve a 40% speedup in training throughput compared to eager code on H100 using a combination of executors including nvFuser, torch.compile, cuDNN, and TransformerEngine FP8."
LLM serving on top of Ray
As per v0.5 supports both vLLM and TensorRT-LLM from NVIDIA as backends.

📄 Papers

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
Interesting architectural modification. Uncertain how it scales to larger architectures.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
GaLore effectively enables full-parameter-training comparable performance at 65% reduced memory usage. Hugging Face Transformers also added support for it 🤗
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
Instruction following retrieval models are a huge boon for crafting better context windows. Check out FollowIR.
Deep Neural Networks Tend To Extrapolate Predictably
Interesting investigation of out-of-sample generalization, including for transformer architectures.
Moirai: A Time Series Foundation Model for Universal Forecasting
Code is also available on GitHub.

🛠️ Products

Passage: A Wafer-Scale, Programmable Photonic Interconnect by Lightmatter
This Boston/Mountain View HQ'ed startup is shipping photonic based interconnects for high speed chip-to-chip connectivity. Their long term vision seems to be to move more ML/AI workloads into the photonic domain with their Envise chip.

📚 Resources

Interesting idea from Cursor founder: Very Large Prefix Caching (VLPC) as alternative to fine-tuning
Taking advantage of long context windows and in-context learning ability to cache 2M prefix tokens. Haha I came up with VLPC, it’s not an accepted term (yet).
Principled model merging to build a Japanese LLM
By Sakana AI, a lab founded by authors of the original Attention Is All You Need.
Virtual Machinations: Leveraging the Linguistic Bytecode of Large Language Models to Emulate Programming Language VMs
Awesome talk by ex-Meta AI researcher. He just quit yesterday! DSPy framework author believes Meijer is building something similar to DSPy.
Is ChatGPT getting unbundled by vertically focused apps?
Interesting exploration of Swyx by Latent Space pod/newsletter. Check out the infographic to see the main vertical domains of unbundling.
Fireside Chat w/ Mistral CEO, Arthur Mensch at Figma
Takeaways in a single Tweet.
Making ChatGPT Code Interpreter implement vector similarity as a native SQLite extension
By your favorite's LLM sommelier's favorite LLM sommelier Simon Willison. Oh yeah, he did it in 1 hour, on his phone, while making coffee.
Beam search visualizer
Hone your intuition for how beam search for LLM decoding works! Post on X.
Reference implementation for Fine-Tuning Mistral 7B Base v0.2
By Mistral themselves, released as part of their hackathon.
Using LLMs to fuzz a C GIF library
Spoiler: it found 4 memory safety bugs and one hang!

Want more? Follow me on X! @ricklamers

Coding with Intelligence