Are LLM-powered agents around the corner?

Week 12 of Coding with Intelligence

Rick Lamers

Mar 20, 2024

📰 News

GPT-5 purportedly previewed to CEO & good at agentic loops
This combined with the release of the Devin SWE it looks like more real progress is being made to get agents to work end-to-end.
Claude 3 Haiku API now available: GPT-3.5 Turbo competitor
It's positioned as a GPT-3.5 Turbo competitor and seems to perform exceptionally well on input tokens speed (21K input tokens per second for prompts under 32K tokens). It has a surprisingly high HumanEval score of 75.9%. Could be great for coding devtools.
xAI releases 314B MoE model
314B total params 86B active params in forward pass, 8 experts. Given size/performance it looks dead-on-arrival. See this Reddit thread. Probably best to just stick with Mixtral and wait for Llama 3.

*We thank Imprompt for sponsoring Coding with Intelligence 🙏*

tl;dr Building your own GPTs but need more features & control? Readers of CoWI get early access by signing up here.

📦 Repos

LLM4Decompile: Decompiling Binary Code with Large Language Models
Very cool use case for LLMs. Basic formula is very powerful: use LLMs for something humans have to do manually where pattern recognition is the main human skill being exercised.
rerankers: making it dead simple to rerank
Are you reranking in your RAG pipeline? Probably not. Start today!
MoAI: a VLM that uses traditional CV for feature extraction
There's also a paper. This visual language model takes the interesting approach of extracting features using more traditional computer vision techniques like OCR. Although it outperforms multi-modal LLMs (both proprietary and open source), it does feel like potentially there's better scaling behavior for end-to-end approaches like those used in LLaVA.
calm: run LLMs locally
Focus "maximum single-GPU single-batch hardware utilization". Mixtral runs at 137 tok/s on a single 4090.

📄 Papers

Apple’s first large VLM: MM1
Claims Gemini 1.0 level performance
Mechanics of Next Token Prediction with Self-Attention
Language models are automatons! This paper tries to answer “What does a single self-attention layer learn from next-token prediction?”

It identifies two distinct steps for next token prediction: hard retrieval and soft composition. Read the paper for more details. Contains mathematical proofs formalizing their result.
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Accelerating long context in Llama 2 by top 3.7x by compressing the KV cache dynamically
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Implementation https://github.com/ezelikman/quiet-star

📚 Resources

Devin AI software engineer 27m walkthrough
I like his remark "imagine this with a reasoning capability bump" which is expected to come from frontier model developers this year "and a 10X improvement in inference speed" (Groq 👀). The impact to how we use computers will be enormous.
Under The Hood: How OpenAI's Sora Model Works
By ex-OpenAI employee. Some reasonable information density. Worth scanning if you're into text-2-video and want to get a sense of the high level approach taken and compute considerations.
LLM inference speed of light
Interesting expose on role of speed bottleneck of memory for LLM inference. Author actually wrote the local LML inference library calm (that achieves impressive performance!).
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Cool project by team from Berkeley, MIT and Cornell. Updates coding benchmark by pulling from LeetCode, AtCoder and Codeforces. Claude 3 Opus, GPT-4 and Mixtral Large are trading places for the top spot if you move the window around a bit. Phind-V2 (CodeLlama fine-tune) seems to do surprisingly well. Either it's optimized for these kinds of code problems (LeetCode-esque) or it's generally good at coding. Alas, benchmarks are still imperfect.
Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation
Input parsing isn’t like sequential decoding; why treat it the same?
Google AI: cappy - boosting large multi-task language models with a small scorer
A common pattern spawning for generating synthetic instruction data for fine-tuning. Related to Nous Research recent release of Genstruct (https://huggingface.co/NousResearch/Genstruct-7B). Unfortunately, no model released by Google AI.

Want more? Follow me on X! @ricklamers

Coding with Intelligence

Are LLM-powered agents around the corner?

Week 12 of Coding with Intelligence

Discussion about this post