Meet the offspring of Llama-2: a thriving ecosystem of commercial-use derivative models
Week 31 of Coding with Intelligence
📰 News
NewHope: Llama-2-13b fine-tuned for coding tasks achieves 66.5 pass@1 HumanEval
We'll have to see if the results hold up to scrutiny, might be a nother case of overfitting. But the path (Llama-2-13b as a base and high quality data for fine-tuning) is promising! Furthermore, this is commercially usable because of the Llama-2 commercial use license. Overall, might be the best commercial-use OSS coding LLM available today! HF link: https://huggingface.co/SLAM-group/NewHope
Upstage Llama-2-70b Orca-style fined tuned LLM tops HF Open LLM Leaderboard
It also includes RoPE scaling to allow for 10k input tokens. Impressive work! OSS models continuing to inch closer to proprietary models like GPT-3.5 et al.
Together releases 32K Llama 2 7B model
They make use of the position interpolation paper we reported on earlier. It's available on Hugging Face too, licensed under the original llama2 license it seems. https://huggingface.co/togethercomputer/LLaMA-2-7B-32K
📦 Repos
PromptTools: prompt testing and experimentation
If you're not yet running evaluation for your LLM apps PromptTools might come in handy if you're keen on using an existing framework for defining evals.
Embedchain: Framework to easily create LLM powered bots over any dataset.
FacTool: a self-hosted plug-in for ChatGPT to check facts
Preventing hallucination is difficult, post-generation retrieval based verification strategies are interesting and this locally hosted plug-in allows you to use SERPs and webpage scraping to verify GPT's claims. A great step in the right direction!
LLM Reasoners: A library for advanced large language model reasoning
Extracting maximum performance from LLMs for complex reasoning tasks is an open problem. This framework from researchers at UC San Diego, University of Florida and Mohamed bin Zayed University of AI attempt to use an explicit world and reward model for optimally navigating reasoning tasks using MCTS. They also write up their findings in a paper https://arxiv.org/abs/2305.14992
Ax: axeval and axgen. LLMs in TypeScript
Great to see options in TypeScript for not just inference but also eval pipelines. Check out the `llmRubric` in the README, simple & efficient prompt evaluation!
Continue: open source Copilot Chat for VS Code
It even shows you code diffs for selected code replacing!
LLMFlows: a LangChain alternative
Check out their abstractions and see which one you like more!
📄 Papers
The Hydra Effect: Emergent Self-repair in Language Model Computations
Interesting research on the inner workings of LLMs: "language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers)"
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
Interesting idea for speeding up inference! However, it will only work for specific queries (that match the structure of a 'skeleton of thought') and the impact on inference quality is tricky to evaluate.
WebArena: A "Virtual Machine" for Building Autonomous web Agents
Benchmarking and evaluating web agents that perform intelligent autonomous action-taking behavior is tricky if you're not operating in a sandbox. The consequences can cause real-world damage/introduce liabilities and it's hard to run controlled experiments generating signal through the noise. This is an OpenAI Gym of sorts for web-agents and this approach is likely going to be adopted by all teams building in this area.
🛠️ Products
Gorilla CLI: LLMs for your CLI
Simply run
pip install gorilla-cli
and use it likegorilla "list all files that start with hello"
. Really cool project from a UC Berkeley team. Extension of https://gorilla.cs.berkeley.edu/ paper. Be careful, the commands are being sent unencrypted to a GCP hosted endpoint (presumably this is where the LLM runs/is called from).
📚 Resources
Asking ChatGPT 100s of questions in one minute
A vision for LLM use I very much agree with. Our anthropomorphism kicks in when using LLMs directly (treating the system as another human) and it limits our thinking on how to best utilize this system. nostalgebraist is onto something here I believe.
San Francisco Compute – 512 H100s at <$2/hr for research and startups
Apparently someone loaned them a cool $20M to purchase 512 H100s. Supposedly they offer better prices than Lambda and low commitments. Startups only it seems.
Patterns for Building LLM-based Systems & Products
By Eugene Yan, Senior Applied Scientist at Amazon. I largely agree with the taxonomy of the seven key patterns he's identified. It's a 65 min read, so be ready for a deep-dive. His treatment of evals is comprehensive and important, as evals are one of the areas most crucial in building good LLM applications but also in building better models (did trick X or Y yield more improvement).
Want more? Follow me on Twitter! @ricklamers