Mistral drops 8x22B & Llama 3 imminent

Week 15 of Coding with Intelligence

Rick Lamers

Apr 10, 2024

📰 News

Canadian government committing $1.75B USD on GenAI initiatives
I think we can officially say AI is a national concern now.
Llama 3 preview release slated for next week
LeCun still saying JEPA is better than generative AI. Time will tell 🙌 but it's excited to see people trying other ideas.
CodeGemma and RecurrentGemma released by Google
CodeGemma still somewhat below DeepSeek-coder, RecurrentGemma (based on Griffin architecture, see paper) is much faster when input token length increases (e.g. 5x Gemma token/s on 8K input sequence).
Groq lands support for function calling/tool use
I've led the implementation - let me know what you think! We're working on some exciting new features related to this that you haven't yet seen at any other LLM API provider, stay tuned.
Mistral dropped Mixtral-8x22
Seems like it's not instruction tuned, so no chat yet.
$10K A::B LLM challenge solved
A::B is an example of a formal system of string rewriting. See problem definition here. It’s cool to see problem solving ability on problems like these can be unlocked by prompting.
DeepMind files patent for sequence to sequence models (LLMs) paired with Monte Carlo tree search
Discussion on Reddit: is this Q*?
Hugging Face TGI inference server reverts to Apache 2.0
Permissive licensing is back for TGI. Makes the choice between vLLM and TGI less obvious.
Intel Gaudi 3 looks competitive to NVIDIA H100s for LLM training
Gemini Code Assist in VS Code
"Apply edits" and "reasoning about the entire codebase" seem like killer features. Curious to see Cursor's response. Extension looks very barebones to be honest.
Cohere releases Command R+
104B, 128k context window, good citation support, tool use, multi-lingual, about half GPT-4 Turbo pricing, try on LMSYS Chat.
Gemini 1.5 Pro now widely available
1M context-window, text, image, audio & video support. Also, new embedding model with good MTEB performance.
Updated GPT-4 Turbo model
Combines GPT-4 Vision with Function Calling, not all reviews are positive.

We thank Imprompt for sponsoring Coding with Intelligence 🙏

tl;dr Building your own GPTs but need more features & control? Readers of CoWI get early access by signing up here.

📦 Repos

llm.c: a minimal C/CUDA based training library by Andrej Karpathy
Education value = high. Up next: direct CUDA, CPU version, Llama2/Gemma arch.
JetMoE: 2B active params MoE, outperforms 7B Llama-2 (<$100k)
Trained on public data only. MoE architecture variant.
Open Parse: layout aware PDF document chunking for LLMs

📄 Papers

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Or as Gary Marcus put it "Breaking news: Scaling will never get us to AGI". I don't think the paper result "multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance" comes as a surprise. Most folks know LLMs approximately require doubling of data for linear increase in accuracy metric performance.
Striped Attention: Faster Ring Attention for Causal Transformers
Training LLMs over Neurally Compressed Text
Interesting idea to reduce sequence lengths
Advancing LLM Reasoning Generalists with Preference Trees
Paper. Interesting application of KTO fine-tuning algorithm.
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Interesting: "Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model’s knowledge capacity."
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Visual AutoRegressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction
Higher quality image generation with 20x faster inference. Feels nearly too good to be true, but the demo delivers. Very impressive work by Bytedance.
Long-context LLMs Struggle with Long In-context Learning
Important shift in emphasis: not find the needle in the haystack (context-window) but actual utilization (in-context learning).
Stream of Search (SoS): Learning to Search in Language
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Apple getting more into AI: "Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks." Interesting implications for agent products that deal with UIs.
Linear Attention Sequence Parallelism
Authors remark on capabilities beyond sequence parallelism found in Megatron-LM and DeepSpeed libraries. Mainly that it is independent of attention heads partitioning, enabling it to support varying numbers or styles of attention heads, such as multi-head, multi-query, and grouped-query attention.
ReALM: Reference Resolution As Language Modeling
Apple Research does original LLM research on reference resolution and in doing so outperforms GPT-4 on this task.
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
By Google DeepMind

🛠️ Products

Krea: generative AI image studio
Doodle to image on steroids.
3D AI Studio: create 3D objects with AI
Voyage AI: rerankers & custom embedding models

📚 Resources

Stanford course on Transformers
BitMat: 1.58bit LLMs efficiently implemented using Triton
Attention in Transformers explained by 3Blue1Brown
Great video to build more intuition, even if you've seen the math already.
Ring Attention by Andreas Köpf
Hypothesized this is how Gemini gets to 1M/10M tokens in their context-window.
Mamba architecture explained in a blog post
Or a YouTube video if that’s more your style (different source, same topic).
Unsloth: automatic RoPE Scaling for fine-tuning Mistral v2 7B (16bit LoRA or 4bit QLoRA)
Uses cleaned Alpaca data.
Schedule free version of AdamW by Meta

Want more? Follow me on X! @ricklamers

Coding with Intelligence

Mistral drops 8x22B & Llama 3 imminent

Week 15 of Coding with Intelligence

Discussion about this post