Assistants API, xAI Prompt IDE, and other tools to Boost LLM app development

Nov 08, 2023

📰 News

European OpenAI Competitor Aleph Alpha Raises $500M
xAI publishes results about their second LLM: Grok-1
On the Hungarian National High School Math Exam (which was released in May 2023), Grok passed the exam with a C (59%), while Claude-2 achieved the same grade (55%), and GPT-4 got a B with 68%.
DeepSeek releases Coder: a 33B/7B family of coding models
Performance on coding tasks is _extremely_ impressive. You can also play with the model directly without downloading it from Hugging Face at https://coder.deepseek.com/chat. Benchmark HumanEval score is 79.3%, but as always, the best way to validate generalization/lack of overfitting to benchmarks is to vibe check with some of your personal favorite coding questions.
Valued at $1B, Kai-Fu Lee’s LLM startup unveils 34B open source model
It is claimed that they want to be the OpenAI of China. Also see this TechCrunch article https://techcrunch.com/2023/11/05/valued-at-1b-kai-fu-lees-llm-startup-unveils-open-source-model/

📦 Repos

📄 Papers

Fine-Tuning LLaMA for Multi-Stage Text Retrieval
There is more and more focus on RAG systems to improve task performance of LLMs. OS models can be refitted for ranking, a crucial building block for your RAG pipeline.
Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
Human-like systematic generalization through a meta-learning neural network
Very interesting paper published in Nature about an approach that sits in-between rigid symbolic systems and overly flexible neural networks. This might unlock more efficient learners that emphasize learning underlying patterns for increased sample efficiency.
Language models implement simple word2vec-style vector arithmetic
This paper expands on some of the other work in circuit analysis showing through which mechanisms Transformer architectures produce outputs. My point in sharing papers like these is to help you understand how to get maximal performance from LLMs for your systems (if you know how they produce outputs you can become better at producing desirable outputs from them).
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks
tl;dr GPT-4 performs best, CoT doesn't always improve performance, humans beat GPT-4 by a wide margin. Flan-T5-11B is better than GPT-3.5.

📚 Resources

OpenAI releases Assistants API
It supports multi-function calling, file upload/Retrieval andCode Interpreter. Interesting set of primitives: Threads, Runs, and Run Steps.
Advice on doing AI research by Jason Wei from OpenAI
Google Bard susceptible to data exfiltration attack through Prompt Injection
Learn from this example (!) or be doomed to ship insecure LLM apps.
xAI Prompt IDE in private preview (docs open)
Some interesting ideas to develop tests for LLM prompts: check out the docs here https://x.ai/ide/docs

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence