The LLM build phase: these tools will help you build faster

Aug 16, 2023

📦 Repos

Outlines: Generative Model Programming
YALCC: Yet Another LangChain Clone. Some interesting ideas around the API interface for constrained predictions though! generate.regex(model, r"\s*([Yy]es|[Nn]o|[Nn]ever|[Aa]lways)", max_tokens=30)(prompt)
Demonstrate-Search-Predict Python
A framework for composing retrieval and language models for knowledge-intensive NLP. Good resource (paper & repo) to mine for ideas when building your own LLM apps.
LLFn: LangChain alternative
Their stated goal for the framework: "making it 100x easier to experiment and ship AI applications using the LLMs with minimum boilerplate"

📄 Papers

AdaTape: Foundation model with adaptive computation and dynamic read-and-write
AdaTape presents a new approach to Transformers' traditional "fixed-cost for every input" approach. A dynamic approach gives the ability to tune the cost of inference depending on the complexity of the input. This makes intuitive sense, not all problems are equally difficult, so allocating a fixed compute budget for each input is quite counter-intuitive. Compare for example "I hate this movie! Sentiment:" and "An alternative strategy to modern economic policy should at least include the following ideas:".
The False Promise of Imitating Proprietary LLMs
This paper suggests that knowledge distillation through example chat conversations does not perform as well as initial experiments seemed to show.
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents
Good to see research labs like Salesforce contribute to better evaluation of LLM Augmented Agents (LAA, new term?). Based on the results of the paper it looks like a “many agent” approach of specialization could work!

📚 Resources

Code annotated explanation of training the Llama transformer model from scratch
If you're more into videos you can also check out Andrej Karpathy's Let's build GPT https://www.youtube.com/watch?v=kCc8FmEb1nY If you're more interested in fine-tuning Llama you might want to check out this fine-tuning notebook https://github.com/mshumer/gpt-llm-trainer/blob/main/One_Prompt___Fine_Tuned_LLaMA_2.ipynb
Pre-train Megatron-LM on AWS custom silicon Trainium
It's good to know there are options beyond NVIDIA. The chip & associated instances on AWS are available today. Note you can also use the Neuron SDK for fine-tuning in addition to pre-training from scratch.
Making AMD GPUs competitive for LLM inference
AMD GPUs are quickly becoming more common on both the training & inference side. This article gives a short overview of performance data gathered using an RX 7900 XTX using ROCm. It reaches 80% of the tokens/sec of an RTX 4090 for Llama2-7B/13B. Note, an RTX 4090 is about twice as expensive as the 7900 XTX at current retail prices. Resulting in cost-per-performance for AMD that outperforms the NVIDIA GPU by a large margin! Note these are relatively small language models and multi-GPU inference setup with datacenter GPUs (A100, H100, MI300) might not have similar results.

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence