Pixtral 12B dropped: Mistral's first Vision Language Model

Sep 11, 2024

📰 News

Ask Photos: A new way to search your photos with Gemini
"Talk to your Photo Album" is a neat feature. Definitely a step up from semantic image search in your albums.
Act I: Exploring emergent behavior from multi-AI, multi-human interaction
Apparently $32k comes from Marc Andreessen
Mistral drops Pixtral a 12B Vision Language Model based on Mistral Nemo
Image/Text interleaving support, 1024x1024 pixels image encoder, 16x16 patch size. Supposedly based on EVA-CLIP https://github.com/baaivision/EVA/tree/master/EVA-CLIP

📦 Repos

FluxMusic: Text-to-Music Generation with Rectified Flow Transformers
Udio/Suno like-tech is being open sourced. Implication: will proprietary labs only be 1 or 2 quarters ahead in AI? Or maybe algorithms will be the perpetual commodity with only data & compute offering moats.
ell: A language model programming framework.

📄 Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
"lifelike and high-quality results across various scenarios" you can say that again, check out the Einstein example. China is behind in AI? No way.
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
By researchers at Stanford. Contradicts Betteridge's Law of Headlines so worth a read :)
Planning In Natural Language Improves LLM Search For Code Generation
"Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%)."
Imitating Language via Scalable Inverse Reinforcement Learning by researchers from Google DeepMind
From the abstract: "We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation." Which in theory should help with robustness.
Improving Pretraining Data Using Perplexity Correlations

📱 Demos

Read their lips
This tool actually takes mouth movement and predicts what is being said, I tried it on a webcam recording of myself thinking it wouldn't work (I removed audio with ffmpeg) - but it totally does :O

Created by a University of Waterloo alumni.
Illuminate: Transform your content into engaging AI-generated audio discussions
Very cool! There's even a podcast episode on the Attention is All You Need paper!
Gemma Scope: LLM explainability by Google

📚 Resources

Want more? Follow me on X! @ricklamers

Coding with Intelligence