Vision Models are having their moment

May 22, 2024

📰 News

Paris-based Holistic AI exits stealth with $220M Seed
Stanford dropout and 4 ex-DeepMind employees. From their materials "frontier action models to boost the productivity of workers”. Not unlike Imbue and Adept.
Cursor announces 1000 tok/s for specialized Llama 3 70B code edit model
Likely they're using some form of constrained + speculative decoding which is known increase tokens per second due to the ability to skip expensive decoding steps of the full model's forward pass.
Phi-3 14B/Phi-3 Vision 128K model released
CogVLM 2 19B
Beats GPT4 V/Gemini Pro on TextVQA, DocVQA and ChartQA by a decent margin, 19B params, Llama 3 8B (Instruct) text backbone, 8K context length, 1344 X 1344 resolution supported, commercial use allowed.

📦 Repos

📄 Papers

LoRA Learns Less and Forgets Less
By MosaicML/Databricks researchers. Essentially LoRA is a tradeoff of better remembering of the pre-trained data at the cost of fitting to the new data less well. To be expected, but nice to see it investigated.
What matters when building vision-language models?
Details about the Idefics2 model and general concerns when developing vision-language models.
Meta multi-modal LLM, Chameleon: Mixed-Modal Early-Fusion Foundation Models
Supports imagegen: "performs non-trivial image generation" and "exceeds the performance of much larger models, including Gemini Pro and GPT-4V".
Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation
"we can view an LM as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time"
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation
Introduces meta tokens to alleviate patch information redundancy and with it achieves 1.7× speedup in inference. Repo.
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Pretty mindblowing results for an approach that doesn't require training.
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
MileBench: Benchmarking MLLMs in Long Context

📚 Resources

What are Diffusion Models? (2021)
Lilian Weng is terrific as always.
Cody (Sourcegraph coding assistant) releases OpenCtx
Standardizing rich context information for coding assistants.
Mapping the Mind of a Large Language Model
New interpretability work from Anthropic.
Needle in a Needlestack
OpenAI’s GPT-4o does really well on this long context task, surprising given some of the other reported results like (link)
To InfiniBand or to Ethernet; to cluster makers that's the question
Gemini 1.5 Technical Report
153 pages of details about the latest Gemini models. Covers both Gemini Pro and Gemini Flash.
PaliGemma fine-tuning notebook
JAX and big_vision based

Want more? Follow me on X! @ricklamers

Coding with Intelligence