Assistants API, xAI Prompt IDE, and other tools to Boost LLM app development
Week 45 of Coding with Intelligence
π° News
xAI publishes results about their second LLM: Grok-1
On the Hungarian National High School Math Exam (which was released in May 2023), Grok passed the exam with a C (59%), while Claude-2 achieved the same grade (55%), and GPT-4 got a B with 68%.
DeepSeek releases Coder: a 33B/7B family of coding models
Performance on coding tasks is _extremely_ impressive. You can also play with the model directly without downloading it from Hugging Face at https://coder.deepseek.com/chat. Benchmark HumanEval score is 79.3%, but as always, the best way to validate generalization/lack of overfitting to benchmarks is to vibe check with some of your personal favorite coding questions.
Valued at $1B, Kai-Fu Leeβs LLM startup unveils 34B open source model
It is claimed that they want to be the OpenAI of China. Also see this TechCrunch article https://techcrunch.com/2023/11/05/valued-at-1b-kai-fu-lees-llm-startup-unveils-open-source-model/
π¦ Repos
MSFT's DeepSpeed releases vLLM inference alternative FastGen: 2.3x higher effective throughput
In case you don't prefer the OpenAI walled garden.
Breadboard: LangChain alternative by Googler
I like their Mermaid generator https://github.com/google/labs-prototypes/blob/main/seeds/graph-playground/docs/graphs/math.md Accommodating blog posts: https://glazkov.com/2023/08/22/composing-graphs-with-breadboard/ and https://glazkov.com/2023/11/03/why-ai-orchestration/
Model Merge strategy by Alibaba researchers
Interesting improvements in ability: "For instance, the merger of WizardLM and WizardMath increases the GSM8K zeroshot accuracy of WizardLM from 2.2 to 66.3"
OpenAI open sources Whisper large-v3
10% to 20% reduction of errors compared to large-v2. Whisper is a Speech Recognition model.
π Papers
Fine-Tuning LLaMA for Multi-Stage Text Retrieval
There is more and more focus on RAG systems to improve task performance of LLMs. OS models can be refitted for ranking, a crucial building block for your RAG pipeline.
Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
Human-like systematic generalization through a meta-learning neural network
Very interesting paper published in Nature about an approach that sits in-between rigid symbolic systems and overly flexible neural networks. This might unlock more efficient learners that emphasize learning underlying patterns for increased sample efficiency.
Language models implement simple word2vec-style vector arithmetic
This paper expands on some of the other work in circuit analysis showing through which mechanisms Transformer architectures produce outputs. My point in sharing papers like these is to help you understand how to get maximal performance from LLMs for your systems (if you know how they produce outputs you can become better at producing desirable outputs from them).
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks
tl;dr GPT-4 performs best, CoT doesn't always improve performance, humans beat GPT-4 by a wide margin. Flan-T5-11B is better than GPT-3.5.
π Resources
OpenAI releases Assistants API
It supports multi-function calling, file upload/Retrieval andCode Interpreter. Interesting set of primitives: Threads, Runs, and Run Steps.
Google Bard susceptible to data exfiltration attack through Prompt Injection
Learn from this example (!) or be doomed to ship insecure LLM apps.
xAI Prompt IDE in private preview (docs open)
Some interesting ideas to develop tests for LLM prompts: check out the docs here https://x.ai/ide/docs
Want more? Follow me on Twitter! @ricklamers