Discover more from Coding with Intelligence
Falcon 180B lands & OpenAI's story in WIRED: how Alec Radford's discovery of the Transformer changed everything
Week 36 of Coding with Intelligence
And it's better than Llama 2 (2% better on HF OpenLLM leaderboard), albeit at nearly 2.6x the parameter budget (70B vs 180B).
Full F16 precision 34B Code Llama at >20 t/s on M2 Ultra
Cheer on the training from the sidelines watching the loss plot on W&B https://wandb.ai/lance777/lightning_logs/reports/metric-train_loss-23-09-04-23-38-15---Vmlldzo1MzA4MzIw?accessToken=5eu2sndit2mo6eqls8h38sklcgfwt660ek1f2czlgtqjv2c6tida47qm1oty8ik9 First 105B token checkpoint performs 43.50 on HellaSwagAcc_norm. ETA for wrapping up training is 2023-12-01. It's being trained on 16 A100-40G GPUs.
This is the technique used by llama.cpp
Very exciting to see that models can self-improve. This provides more evidence there is a more to come in terms of SOTA performance from LLMs.
Optimizations for RLHF combined with synthetically generated feedback data (RLAIF) means we might get more powerful models from released open source base models fairly soon. Stay tuned (haha get it?)
YaRN (Yet another RoPE extensioN method), "show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow", requiring 10x less tokens and 2.5x less training steps than previous methods
No you don't need this, yes it's really cool.
Run it locally using llm-mlc datasette plugin: https://github.com/simonw/llm-mlc
I think we should embrace every benchmark we can get, to improve we need to measure not guess. There's also a paper: https://arxiv.org/abs/2308.16884 gpt-3.5-turbo performs about 23% better than llama-2-chat.
Want more? Follow me on Twitter! @ricklamers