What tradeoffs is OpenAI's GPT-4o making to achieve its speed?

May 15, 2024

📰 News

OpenAI is making some interesting tradeoffs with GPT-4o
Long context performance seems to fall off a cliff compared to GPT-4 Turbo on BABILong eval.
Google Cloud introduces TPU v6: Trillium
Double HBM bandwidth and capacity, 67% less power hungry, doubled ICI (chip interconnect) bandwidth versus TPU v5e. 256 TPUs can be configured in a single pod.
LLaVA-NeXT now supports Llama 3 and Qwen
Gemini Pro 1.5 increases to 2M tokens, Gemini Flash 1.5 lightweight model
Google AI teases Veo; their answer to OpenAI's Sora text-to-video model
Consistency Large Language Models: A Family of Efficient Parallel Decoders
Parallelizing the decoding process to speed up inference by 3.5x. Currently limited to greedy autoregressive decoding. Other sampling strategies are in the works.
OpenAI releases Model Spec
Describing desired behaviors to distinguish bugs from intended behavior
OpenAI releases GPT-4o
TLDR; more multimodal (new abilities), lower latency, faster tokens/s, slightly reduced (complex) coding ability. Some surprising things: it can generated 3D objects, generate a consistent set of images (e.g. comic book consistency), and sing.
Google AI releases PaliGemma and teases Gemma 2
PaliGemma is about as good as a LLaMA-NeXT Llama 3 8B base and Gemma 2 (27B) is about as good as Llama 3 70B (so at half the parameter budget). Gemma 2 is not yet out, they mention "it will be released in the coming weeks", PaliGemma is on Hugging Face.

📦 Repos

ThunderKittens (TK), a simple DSL embedded within CUDA
“It makes it easy to express key technical ideas for building AI kernels” First achievement: 30% faster Flash Attention with fewer lines of code.
audio-diffusion-pytorch: a fully featured audio diffusion library, for PyTorch
LLM Comparator by Google AI
An interactive visualization tool for analyzing side-by-side LLM evaluation results.
Mirage: A Multi-level Superoptimizer for Tensor Algebra
It outputs Triton programs that in some cases beat manually created kernels.
OpenDevin reaches 21% in SWE-bench
This blog dives into CodeAct 1.0, a crucial system component.
LLM UI: render partially generated Markdown robustly
Cool project with high utility for LLM inference where tokens/sec is slow.
InternLM-XComposer2 vision-language large model (VLLM)
"matches or even surpasses GPT-4V and Gemini Pro in 6 benchmarks"
Llama 3 8B Web: building agents that can browse the web by following instructions
Outperforms GPT-4V on WebLINX benchmark.

📄 Papers

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Memory Mosaics
Alternative to the Transformer architecture by Meta/FAIR
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
> Deja Vu (175 billion parameters) produced a sequence of 128 tokens in 20 milliseconds, while an Nvidia implementation of OPT of the same size needed 40 milliseconds and a Hugging Face implementation of OPT of the same size needed 105 milliseconds. Moreover, Deja Vu achieved these speedups without reducing accuracy.
Zero-Shot Tokenizer Transfer
"ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training."
KAN based Transformer training experiment
The Platonic Representation Hypothesis
TLDR it argues that different AI models are increasingly representing data similarly, converging towards a unified understanding of reality, akin to Plato's ideal forms. This trend happens as models grow larger and more versatile, showing evidence of this alignment across various architectures and data types.
You Only Cache Once: Decoder-Decoder Architectures for Language Models by Microsoft Research
At 512K context window size it features about 10X better throughput compared to vanilla Transformer architecture.

📚 Resources

GPUs Go Brrr by Stanford's Hazy Research
I think this takeaway is really the core: "In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands. An important future direction of this work for us is to use our learnings about the hardware to help us design the AI to match."
Llama 3 performance analysis based on Arena data
Brought to you by LMSYS.
Anthropic Console ships prompt generator
Might be useful to evaluate your handwritten prompts.

Want more? Follow me on X! @ricklamers

Coding with Intelligence