Beyond vanilla Transformers: Multi-Tokens, xLSTM, KANs & more

Week 19 of Coding with Intelligence

Rick Lamers

May 08, 2024

📰 News

NVIDIA releases fine-tuned Llama 3 70B called Llama3-ChatQA-1.5, optimized for RAG based QA
LM Studio introduces `lms` a companion CLI tool
JetBrains ships single-line code completion with local 100M param 1536 context-window model
Qwen 1.5 110B released
Beats MMLU of Llama 3 70B, very impressive.
The Thorn in a HaizeStack Test for Long-Context Adversarial Robustness
"👀tldr => a jailbreak text ("Thorn") embedded in a wall of distractor text ("HaizeStack") easily circumvents GPT-4's (and other) safeguards."
Karpathy shares progress report on his CUDA training project llm.c

📦 Repos

Hyper-SD: SDXL alternative by ByteDance
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Based on Mixtral
Kolmogorov-Arnold Networks (KANs) by MIT
“Kolmogorov-Arnold Networks (KANs) are promising alternatives of Multi-Layer Perceptrons (MLPs). KANs have strong mathematical foundations just like MLPs: MLPs are based on the universal approximation theorem, while KANs are based on Kolmogorov-Arnold representation theorem. KANs and MLPs are dual: KANs have activation functions on edges, while MLPs have activation functions on nodes. This simple change makes KANs better (sometimes much better!) than MLPs in terms of both model accuracy and interpretability.”

Great overview in the AI News newsletter, there's a claim that KANs can be written as MLPs and hence they might be equivalent.

📄 Papers

Apple ML paper: How Far Are We from Intelligent Visual Deductive Reasoning
Analysis of data engineering and pretraining mix for extending LLMs’ context-window to up to 128K tokens
The benchmarks in this paper are only testing the longer context-window to a limited extent. But some learnings around continued pretraining for long context might generalize to other context-window extension techniques.
xLSTM: Extended Long Short-Term Memory
The paper contains a neat comparison of several architectures like Mamba, Llama Transformer and RWKV on the SlimPajama 15B token dataset. Some remarks on unfair comparisons can be found on X.
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
”We show that vAttention reduces software complexity while improving portability and performance.” and “vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.”
Self-Play Preference Optimization for Language Model Alignment
In-Context Learning with Long-Context Models: An In-Depth Exploration
By a team of researchers from CMU and Tel Aviv University.
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Tencent paper on techniques purportedly related to Q* algorithm for LLMs
A Survey on Vision Mamba: Models, Applications and Challenges
tl;dr in a head-to-head comparison on MS COCO mini-val Vision Mamba outperforms Transformers and ConvNets based models.
Better & Faster Large Language Models via Multi-token Prediction
“we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency” and “Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP”

Promising path to getting models to reason more explicitly at the level of ideas instead of tokens.
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Nice exploration of "training on the test set" problem of LLMs. Benchmarks become less useful as test data leaks into pre-training. This paper shows Claude, GPT, Mistral Large regress the least on this "unseen" test data.

📚 Resources

MoE Vision tutorial on HF blog
Vision Arena-style leaderboard by Allen Institute for AI
Features GPT-4 Vision, Reka, Claude, Gemini, Yi, Llava, DeepSeek, Qwen
Ollama Python SDK
Modern Advances in Prompt Engineering
Great long read to improve your prompts
Gemma walkthrough with code from Graphcore researcher
Reka releases Vibe-Eval: A new open and hard evaluation suite for measuring progress of multimodal language models
I think focusing on failure modes of LLMs it the most interesting. The multimodal domain is rich with examples where results are subpar. Following breadcrumbs here is likely to yield ideas for improvements to the various aspect that determine performance (data, model, inference, etc.).
Lessons learned from difficult to track errors during large scale pretraining by Adept
When not even GPU ECC protection saves you.

Want more? Follow me on X! @ricklamers

Coding with Intelligence

Beyond vanilla Transformers: Multi-Tokens, xLSTM, KANs & more

Week 19 of Coding with Intelligence

Discussion about this post