Fine-tuning LLMs on Human Feedback (RLHF + DPO)

Fine-tuning LLMs on Human Feedback (RLHF + DPO)

Shaw Talebi

🤝 Your team not maximizing Claude? I run 1:1 and team AI workshops for companies doing $10M+ per year: https://aibuilder.academy/yt/bbVoDXoPrPM Here, I discuss how to use reinforcement learning to fine-tune LLMs on human feedback (i.e. RLHF) and a more efficient reformulation of it (i.e. DPO) 📰