DeepSeek-Coder-V2 dropped: GPT-4/Opus level open source coding model

Week 25 of Coding with Intelligence

Rick Lamers

Jun 19, 2024

📰 News

DeepSeek-Coder-V2 released
90.2 on HumanEval and high scores across various other code orientations benchmarks. Stellar release!
Stable Diffusion 3 Medium weights on HF
Non-Commercial use only unfortunately.
NVIDIA releases Nemotron-4-340B
A Llama 3 70B class dense model
DeepMind teases Video-to-Audio model
The drummer example is impressively in sync!
Meta releases a number of AI projects including Chameleon
Chameleon 7B and 34B multimodal models, multi-token prediction model & more.
Lamini introduces "Memory Tuning"
A fine-tuning approach for adding new facts to LLMs. It's a proprietary innovation so replication will be 'left as an exercise to the reader'.
Anthropic pilots Steering API
It’s a new way to guide an LLM to produce desires outputs by influencing “features” of the model. Remember Golden Gate Claude?
Luma Labs releases Dream Machine: Sora level text-to-video
Available for actual use! Instead of just some demo videos.
vLLM now supports FP8 quantization, enabling optimized LLM performance and efficiency
Microsoft releases Florence 2 Small VLM
Very smol model, impressive performance (as good as models 10-50x its size).
DeepSeek-Coder-V2 beats GPT-4o and Opus on Aider code editing tool
Gemini 1.5 Pro and 1.5 Flash API updates
Fine-tuning support, JSON Schema mode, rate limit increases.
Tool-Augmented VIsion (TAVI) Workshop at CVPR 2024
Interesting new paradigm of tools making it to VLMs.

📦 Repos

Apple releases deep learning library AXLearn
"It supports the training of models with up to hundreds of billions of parameters across thousands of accelerators at high utilization."
TextGrad: Automatic ''Differentiation'' via Text
DiffusionKit: Stable Diffusion 3 on Apple Silicon
Another banger by argmax, inc.
Open-MAGVIT2: strong visual tokenizer

📄 Papers

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B
Folks are equating this to Q* or colloquially, test-time search.
Comprehensive RAG Benchmark by Meta AI
A RAG benchmark for factual question-answering consisting of 4,409 question-answer pairs with mock APIs to simulate retrieval for evaluating LLMs.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
Diffusion isn't needed for image generation, and they brought receipts!
Transformers meet Neural Algorithmic Reasoners
"we propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs)", by DeepMind, looks promising.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention
Implemented in JAX, file.
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
"We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks" by Microsoft. Impressive benchmark scores on eg GSM8K, HumanEval, significantly outperforming Mistral 7B.
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
"2.8x Faster on SDXL with 4 devices. Top: 50 step original (13.81s). Bottom: 50 step AsyncDiff (4.98s)"
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
By Anthropic
BERTs are Generative In-Context Learners
Masked models (where only a single word is masked and words left and right of the masked word can be used to predict the masked word) turn out to be pretty capable. Nice paper that shows intuition can misguide (intuitively causal models would work better because they can't 'cheat'). It reminds me of the fact that humans are also capable of learning theories and knowledge from "full examples" that contain/spoil the answer.
An Empirical Study of Mamba-based Language Models
“A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset: 7% attention, the rest is Mamba2, MMLU jumps from 50 to 53.6%, training efficiency is the same, inference cost is much less”

📱 Demos

Demo Stable Diffusion 3 Medium with real-time TAESD3 previews

📚 Resources

Model Explorer tool by Google
It supports multiple graph formats, including those used by JAX, PyTorch, TensorFlow and TensorFlow Lite.
RecurrentGemma 9B model released
Similar to Mistral 7B in performance but significantly better inference characteristics on long inputs.
Yi 9B 200k context model in 16GB VRAM
BigCodeBench: another code benchmark
Neat and organized leaderboard, spoiler alert: DeepSeek-Coder-V2 is the only one to lose from GPT4o, showing how good of a model it is.
How Meta trains large language models at scale
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Cool work on edge LLM inference.
Step-by-Step Diffusion: An Elementary Tutorial
LiveBench : A Challenging, Contamination-Free LLM Benchmark
From Yann LeCun's lab at NYU.
Mechanistic Interpretability explained simply
As Chris Olah himself describes it: "If you're familiar with the content, it's very fun to watch. If you're not familiar, it's a very nice way to dip a toe in."
MUIRBENCH a novel VLM benchmark
Tutorial by Maxime Labonne: uncensoring open source LLMs
Dubbed 'abliteration'.

Want more? Follow me on X! @ricklamers

Coding with Intelligence

Discussion about this post

Ready for more?