Discover more from Coding with Intelligence
Assistants API, xAI Prompt IDE, and other tools to Boost LLM app development
Week 45 of Coding with Intelligence
On the Hungarian National High School Math Exam (which was released in May 2023), Grok passed the exam with a C (59%), while Claude-2 achieved the same grade (55%), and GPT-4 got a B with 68%.
Performance on coding tasks is _extremely_ impressive. You can also play with the model directly without downloading it from Hugging Face at https://coder.deepseek.com/chat. Benchmark HumanEval score is 79.3%, but as always, the best way to validate generalization/lack of overfitting to benchmarks is to vibe check with some of your personal favorite coding questions.
It is claimed that they want to be the OpenAI of China. Also see this TechCrunch article https://techcrunch.com/2023/11/05/valued-at-1b-kai-fu-lees-llm-startup-unveils-open-source-model/
In case you don't prefer the OpenAI walled garden.
I like their Mermaid generator https://github.com/google/labs-prototypes/blob/main/seeds/graph-playground/docs/graphs/math.md Accommodating blog posts: https://glazkov.com/2023/08/22/composing-graphs-with-breadboard/ and https://glazkov.com/2023/11/03/why-ai-orchestration/
Interesting improvements in ability: "For instance, the merger of WizardLM and WizardMath increases the GSM8K zeroshot accuracy of WizardLM from 2.2 to 66.3"
10% to 20% reduction of errors compared to large-v2. Whisper is a Speech Recognition model.
There is more and more focus on RAG systems to improve task performance of LLMs. OS models can be refitted for ranking, a crucial building block for your RAG pipeline.
Very interesting paper published in Nature about an approach that sits in-between rigid symbolic systems and overly flexible neural networks. This might unlock more efficient learners that emphasize learning underlying patterns for increased sample efficiency.
This paper expands on some of the other work in circuit analysis showing through which mechanisms Transformer architectures produce outputs. My point in sharing papers like these is to help you understand how to get maximal performance from LLMs for your systems (if you know how they produce outputs you can become better at producing desirable outputs from them).
tl;dr GPT-4 performs best, CoT doesn't always improve performance, humans beat GPT-4 by a wide margin. Flan-T5-11B is better than GPT-3.5.
It supports multi-function calling, file upload/Retrieval andCode Interpreter. Interesting set of primitives: Threads, Runs, and Run Steps.
Learn from this example (!) or be doomed to ship insecure LLM apps.
Some interesting ideas to develop tests for LLM prompts: check out the docs here https://x.ai/ide/docs
Want more? Follow me on Twitter! @ricklamers