The interplay of pre-training and fine-tuning: learn what's happening

Week 44 of Coding with Intelligence

Nov 01, 2023

“Open Source AI/LLMs” is about more than just model drops from open labs. It’s about the interplay of academic researchers & industry participants figuring out how to cost effectively get the most performance from deep learning based models. It’s great to see the open knowledge around how to get the most performance/dollar/watt flourishing. If you’re building LLM apps, there’s no shortage of thoughtful experiments you can run to optimize inference speed, cost & accuracy.

Many of the listings in this week’s CoWI touch on this rapidly evolving landscape. I hope these curated resources make you a better AI researcher & engineer!

📰 News

Mosaic scales to 128x MI250 cards for LLM training
Nathan Lambert moves to Allen AI for open RL/RLHF research
This could be interesting for improving OS models. I'd keep an eye on what he publishes over the next months!
Leak: GPT-3.5-Turbo likely 20B param model
Great news for OS model initiatives if true. Source is a Microsoft paper that's now retracted 👀

📦 Repos

funcchain: the most Pythonic way of using OpenAI functions
Just checkout the example code in the README, it's so clean! It builds on the strengths of Pydantic for validated structured outputs from LLMs.
Open Source version of Vercel's v0
Prompt to UI is gaining traction.
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
This can be particularly useful for routing strategies/systems to LLMs at different cost/speed/quality levels.
Voyager: in-memory nearest-neighbor search by Spotify
Voyager is used extensively in production at Spotify, and is queried hundreds of millions of times per day to power numerous user-facing features.

📄 Papers

An Emulator for Fine-Tuning Large Language Models using Small Language Models
This Stanford paper introduces the idea of using a fine-tuned small model (e.g. Llama-2-chat-7B) and a pre-trained large model (e.g. Llama-2-base-70B) to get the benefits of fine-tuning without costly fine-tuning procedures on large pre-trained models. They pair the two models and use an EFT (Emulated Fine-Tuning) variant of the speculative decoding scheme to predict tokens with high throughput. The main goal of the paper is to disentangle the contribution of the pre-training and fine-tuning stage. The two stage process by which modern LLMs like Llama-2 and GPT-4 are trained. They summarize that fine-tuning generally improves helpfulness, while scaling up pre-training tends to improve factuality.
Google DeepMind creates a WebAgent that performs workflow automation tasks
Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models
Similar to the approach proposed by DeepMind in https://arxiv.org/abs/2310.01714
In-Context Learning Creates Task Vectors
Very interesting paper that gives more insight into how models are able to perform in-context learning. This is a related paper https://arxiv.org/abs/2310.15213
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
The paper is nicely done and shows the validation losses for familiar datasets in the RedPajama dataset for sources like StackExchange, GitHub, arXiv and Wikipedia.
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Interpretability work is very exciting as success in the area will lead to more efficient architectures that improve performance on downstream tasks. This is a nice win for a technique in mechanistic interpretability called "circuit analysis". It looks at what happens inside the language model when performing a specific task, in this paper: answering multiple choice questions. Kudos DeepMind for contributing this result to the broader community!
MosaicML competitor CentML raises $27M
If you're training & serving your own OS models might be worth evaluating.

🛠️ Products

LLM Routing as-a-service
Interesting idea!

📚 Resources

Good summary of executive order and what it means for openness in AI
By the fine folks from AI Snake Oil, a blog by two Princeton scholars.
Math heavy explanation of Diffusion models; or sampling from an arbitrary distribution
Quite nice work by @dan_p_simpson. He narrates the mathematics nicely such that you can get a gist of the general areas of math involved in diffusion models. Good read if you're looking to get started making better diffusion models (for image creation or otherwise sampling from distributions we observe solely from available data).
RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens
This could be useful if you're fine-tuning too, finding relevant bits for synthetic dataset creation using frontier models for domain oriented transformation.
Good interview with Shane Legg (DeepMind Founder) - 2028 AGI, New Architectures, Aligning Superhuman Models
Especially his ideas about Solomonoff induction and the ability to go from sequence models to more powerful (AGI) systems is intriguing.
Lamini shares resources on finetuning LLMs on the AMD stack

Want more? Follow me on Twitter! @ricklamers

Coding with Intelligence

The interplay of pre-training and fine-tuning: learn what's happening

Week 44 of Coding with Intelligence

Discussion about this post