Self-Alignment for fine-tuning takes flight with StarCoder2-Instruct

May 01, 2024

📰 News

StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation
Great news! Breaks GPT-4 distillation data dependence 💪
Snowflake releases 480B-A17B Dense-MoE hybrid
128K Llama 3 70B by Abacus AI
Impressive initial results, further evals will be needed to understand how effective it is at using the additional 120K of tokens in the context-window.

📦 Repos

CoreNet: Apple's Deep Learning training library
Llama 3 8B multimodal adaptation by BAAI
Bunny-Llama-3-8B-V is built upon SigLIP and Llama-3-8B-Instruct. On MME (P) is outperforms LLaVA-v1.6-13B.
torchtitan: PyTorch releases LLM training library
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Use Llama base + LLaVA + video adaptation to describe videos. Very impressive 🙌 Video demo, models, code & paper available.

📄 Papers

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
And DeepSpeed support, see this commit.
Make Your LLM Fully Utilize the Context
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
Maybe Chain-of-Thought prompting (CoT) is not what we think it is! Hidden computation is likely causally behind improved task performance.
Iterative Reasoning Preference Optimization
“increasing accuracy for Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K” you could call this “CoT aware fine-tuning”

🛠️ Products

GitHub Copilot Workspace: Welcome to the Copilot-native developer environment
Try it to believe it, looking forward to broader access. Supposedly it can implement entire PRs for complex projects.

📚 Resources

Loss is the right X-axis for LLM emergent abilities
So need to worry, abilities emerge gradually. I think we can conclude X-risk: low.
Autodidax: learn Jax from first principles
If you ever wanted to go beyond PyTorch and explore Jax, this is a great resource to get started.
AI Snake Oil analyses agent evaluations: from single metrics to pareto fronts
They introduce a useful distinction with model evaluation and downstream evaluation. Furthermore they show simple baselines of agentic approaches to code generation are crucial for interpreting the true merits of more complex agent approaches.

Want more? Follow me on X! @ricklamers

Coding with Intelligence