Anthropic's Claude 2: 4X cheaper than GPT-4-32k and up-to-date world knowledge
Week 28 of Coding with Intelligence
📰 News
71.2% on HumanEval, 100k context window, API available. This is starting to look competitive to GPT-4. Check Jim Fan’s take in this tweet.
Rectification: replit-code-instruct-glaive HumanEval score is inflated
Last week I opened the newsletter with replit-code-instruct-glaive's eye-popping HumanEval score of 63.5%. I warned "if there is no contamination" and alas, it turns out the model was overfitting since HumanEval leaked into the training data. Now who's running the due diligence that's going to verify data leakage for Anthropic's Claude 2 HumanEval score of 71.2% ...
OpenAI creates a dedicated team for alignment
They are hiring additional ML engineers and committing 20% of their compute resources. Headed by Ilya Sutskever.
📱 Demos
OpenRouter: route LLM calls to whatever model you prefer, dynamically
Check out their Streamlit app for intuition around the idea. Don’t forget to check the docs if you want to learn more.
📦 Repos
Smolex - A code retrieval ChatGPT Plugin
Since GitHub Copilot Chat is in private preview we need a smarter way to provide the IDE context to ChatGPT's GPT-4 to discuss our project code. Unless you like copy & pasting a lot! Smolex uses a neat feature of ChatGPT plugin development: locally hosted plugins. It uses an AST based indexer and a vector database for retrieval. One thing I dislike is the required roundtrip of ChatGPT plugins. That makes it notably slower than asking it a direct question. Maybe a ChatGPT plugin design flaw? I can see how direct plugin execution could provide value through speedier prompt construction over LLM based "tool invocation", although the latter is more general and flexible. Maybe when LLM inference doesn't take multiple seconds anymore...
Semantic Kernel: LangChain alternative by Microsoft
Choice is very welcome, but I’m not sure if I’m happy with the DX so far: the docs default to C#, videos over code snippets, it’s kind of all over the place. This planner example is interesting: https://learn.microsoft.com/en-us/semantic-kernel/get-started/quick-start-guide/using-the-planner
gpt-prompt-engineer: let the LLM optimize the prompt
This repo is interesting to me because LLMs aren't magic, they're "just" models that have compressed the training data in a useful way. Typically deep learning models don't do well on "out of distribution" generation. Can prompt optimization be done by aligning with the data distribution the model was trained on? I think this is a very interesting area to explore further! Consider this a call for papers. Found anything? Please let me know.
📄 Papers
LongNet: Scaling Transformers to 1,000,000,000 Tokens
Microsoft Research introduces the concept of dilated attention to achieve a linear increase in computational cost with respect to an increase in length of the input sequence. Interesting scan for ideas for model developers but benchmarks are limited and no trained model is introduced. The trade-off is invariably a less complete attention mechanism than the standard and hard to scale O(N^2) method where every input token is correlated to every other input token. How good their heuristic is will have to be shown through meaningful (ahem perplexity) performance benchmarks.
Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models
This paper likely is an exploration of what GPT-4 did. Main result is that MoE can give better performance at a third of the FLOPS. That makes MoE mainly interesting from a compute efficiency perspective. They don’t show data for an MoE model larger than 32B.
🛠️ Products
Easiest way to run Open Source GPT models on your local device.
Cody: IDE coding assistant for VS Code and IntelliJ
Supports Chats and Completions. Explore the variety of “brushes”. A lot of interesting ideas. It makes heavy use of symbolic scanning of the repo and embeddings for making their assistant understand your code. Noteworthy is their support for multi-repo codebases.
📚 Resources
symbex: search Python code for functions and classes, then pipe them into a LLM
This is a very useful tool for giving the model access to the relevant context in a much smarter way than copying and pasting. Until vector based or other retrieval methods become smart enough I don’t think a developer that knows its codebase and uses something like symbex can be beaten in terms of single shot accuracy.
The hot mess theory of AI misalignment: More intelligent agents behave less coherently
The first informative resource I've found on AI alignment. Sohl works at Google Brain and invented Stable Diffusion. According to his theory he's definitely very incoherent: read the article to understand what I mean :)
Want more? Follow me on Twitter! @ricklamers