Aria MoE A3.9B a new open source multimodal LLM

Week 41 of Coding with Intelligence

Rick Lamers

Oct 13, 2024

📰 News

Aria MoE A3.9B new SOTA open source multimodal LLM
Performance looks competitive with Pixtral, Llama 3.2 11B and in some cases even with GPT4o/GPT4o-mini.
SWE-bench extended with multimodal tasks
$10k (o1) reasoning challenge by Victor Taelin
A challenge to see if frontier LLMs can reason in a way that causes generalization. The task is to invert a perfect binary tree but he adds 3 criteria that make it novel enough to be outside of the pretraining corpus with high likelihood.
MatMamba: An Elastic and Efficient Neural Network Architecture
"Combining the speed of State Space Models (SSMs) like Mamba2 with the adaptability of Matryoshka-style learning." by Scaled Foundations a startup with MIT co-founders working on autonomous robotics. Implementation available on GitHub https://github.com/scaledfoundations/matmamba
OpenAI Leaders Say Microsoft Isn’t Moving Fast Enough to Supply Servers
Paywalled unfortunately. But the key message is that there are rumors that OpenAI is becoming more independent from Microsoft at the datacenter level. I guess they want to move faster than Microsoft allows at the infrastructure level.

📦 Repos

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
Retry is all you need.
F5-TTS open source model
Very good emotive quality TTS: "non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT)", see paper and repo for more details.

📄 Papers

Differential Transformer
An interesting architecture modification that shows better scaling properties than vanilla Transformers. Will be interesting to see if large open source/frontier groups adopt this.
Addition is All You Need for Energy-efficient Language Models
This paper proposes replacing multiplications with additions and shows some convincing data that there's merit to this idea. As with all architecture modifications, the jury isn't out until the ideas are scaled up.
Intelligence at the Edge of Chaos
In this paper they pretrain GPT2 on cellular automata and show that pretraining on more complex automata increases downstream task performance on tasks like chess and abstract reasoning. Fascinating result that seems to reveal something fundamental about transfer learning.
Generative Reward Models - A Unified Approach to RLHF and RLAIF
Interesting survey of alignment methods and a proposed technique that allows for combining expensive human preference data with synthetically generated preference data. They emphasize OOD (out-of-distribution) performance which is sometimes not highlighted enough when comparing alignment techniques. A co-author of the paper invented the DPO method at Stanford.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
"In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized." the challenge is getting them to reliably utilize the correct information they contain.
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
Reproducing Reinforcement Learning implementations is notoriously fraught with many gotchas that individually can cause failure if handled incorrectly. A massive contribution by folks from Mila, Hugging Face and others.
Semantic Training Signals Promote Hierarchical Syntactic Generalization in Transformers
Interesting exploration of the effects of hierarchical biases in the Transformer architecture.
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
Analyzing CoT behavior in LLMs using a specific task (decoding shift ciphers). Conclusion is positive. "Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning."
Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality
Interesting work from Mila that aims to improve on metrics that quantify video generation quality. Since that's so hard to quantify I think this work has a lot of potential to help researchers discover which techniques actually make a meaningful difference. I'm sure all text-to-video startups are all over this (Luma Labs, Runway, OpenAI, Kling, etc.).
Pixtral 12B technical report
Some welcome details on the new Pixtral multimodal model.
Better Instruction-Following Through Minimum Bayes Risk
nGPT: Normalized Transformer with Representation Learning on the Hypersphere

📱 Demos

FacePoke: impressive training free face image manipulation

📚 Resources

Cursor Directory
A list of crowdsourced .cursorrules. No need to write your own prompts for language specific Cursor rules.
[Video] AMD MI325X introduction keynote
Machines of Loving Grace - long read by Anthropic founder Dario Amodei
Subtitled "How AI Could Transform the World for the Better", a positive and grounded essay about the impact of AI from on of the, if not the, best AI labs in the world.
Explainer of Entropy based sampling
Kling AI community short films
Check out SOTA generative AI projects
TxT360 - open source 15T corpus and processing pipeline
"We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics."
Forge by Nous Research @ Nouscon 2024
O1 replication journey by GAIR (Shanghai Jiaotong University)
[Video] PyTorch 2024 SF playlist

Want more? Follow me on X! @ricklamers

Coding with Intelligence

Aria MoE A3.9B a new open source multimodal LLM

Week 41 of Coding with Intelligence

Discussion about this post