Discover more from Coding with Intelligence
Breaking: new OSS LLM record 3B replit-code-instruct-glaive achieves 63.5% pass@1 HumanEval with 75+ tokens/s on 3090
Week 27 of Coding with Intelligence
Waiting on dataset details, if there is no contamination it would be a huge milestone for open source code generating models! While evaluating locally I measured 70 to 78 tokens per second without batching on an NVIDIA 3090 graphics card.
A collection of Transformer training tricks wrapped into a convenient library. Think of it as Keras for Transformers. They claim a 7x speedup over naive approaches.
Cool idea and nice execution allowing business users to create LLM apps/automations. A bit like Zapier and n8n.
LMSYS, a UC Berkeley research group that also created Vicuna have launched projects exploring fine-tuning techniques for larger context windows. LongEval attempts to measure how well the longer context is utilized. Not all context-windows are created equal! Does the model really pick up all the provided information? As well as themes like rule-following ability as the number of rules increases with the context-window.
Combine and use LLMs in multiple stages of your ML pipeline. This paper gives good hints on how to best use LLMs for your downstream tasks.
It looks like larger context windows will become available for more open source models. Meta is backing up its claims of contribution with actual applied research, this is great!
If you're working within a context that can benefit from the symbolic capabilities of the Mathematica/Wolfram ecosystem you might want to check this out. LLMs are notoriously bad calculators so this looks like a powerful combination.
A real-world deep-dive on what it’s like to fine-tune LLMs on AMD hardware. With LLMs based on the PyTorch stack no code changes were necessary. The article highlights how components of the AMD stack map to the NVIDIA counterparts like RCCL to NCCL and ROCm to CUDA.
If you're deploying OSS models on GCP/TPUs then this PyTorch update is a must read. The article provides an overview of the latest "bag of tricks" available to improve inference and specifically max_len inference latency.
Want more? Follow me on Twitter! @ricklamers