Pixtral 12B dropped: Mistral's first Vision Language Model
Week 37 of Coding with Intelligence
๐ฐ News
Ask Photos: A new way to search your photos with Gemini
"Talk to your Photo Album" is a neat feature. Definitely a step up from semantic image search in your albums.
Act I: Exploring emergent behavior from multi-AI, multi-human interaction
Apparently $32k comes from Marc Andreessen
Mistral drops Pixtral a 12B Vision Language Model based on Mistral Nemo
Image/Text interleaving support, 1024x1024 pixels image encoder, 16x16 patch size. Supposedly based on EVA-CLIP https://github.com/baaivision/EVA/tree/master/EVA-CLIP
๐ฆ Repos
FluxMusic: Text-to-Music Generation with Rectified Flow Transformers
Udio/Suno like-tech is being open sourced. Implication: will proprietary labs only be 1 or 2 quarters ahead in AI? Or maybe algorithms will be the perpetual commodity with only data & compute offering moats.
๐ Papers
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
"lifelike and high-quality results across various scenarios" you can say that again, check out the Einstein example. China is behind in AI? No way.
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
By researchers at Stanford. Contradicts Betteridge's Law of Headlines so worth a read :)
Planning In Natural Language Improves LLM Search For Code Generation
"Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%)."
Imitating Language via Scalable Inverse Reinforcement Learning by researchers from Google DeepMind
From the abstract: "We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation." Which in theory should help with robustness.
๐ฑ Demos
This tool actually takes mouth movement and predicts what is being said, I tried it on a webcam recording of myself thinking it wouldn't work (I removed audio with ffmpeg) - but it totally does :O
Created by a University of Waterloo alumni.Illuminate: Transform your content into engaging AI-generated audio discussions
Very cool! There's even a podcast episode on the Attention is All You Need paper!
๐ Resources
LLM inference at scale with TGI
Good overview of what's involved in deploying TGI to production. By the ML folks at Adyen.
Want more? Follow me on X! @ricklamers


