Studying RLHF (2025)

このセミナーについて (About this seminar)

RLHF の最新の教科書(2025年4月公開)を読み、事後学習手法に関する背景知識(強化学習や選好学習等)を身につけるとともに、言語モデルに対して強化学習手法を適用する際の課題を明らかにする。 In this seminar, we will read the latest RLHF textbook (published in April 2025) to gain background knowledge on post-training methods—such as reinforcement learning and preference learning—and to identify challenges in applying reinforcement learning techniques to language models.

  • 日時 Date and Time: 2Q 隔週月曜 17:15 - 18:55 全7回 2Q, Bi-weekly on Mondays, 17:15 - 18:55 (total of 7 seminars)
  • 参加者 Attendees: 言語モデルの事後学習・強化学習に興味がある方 Anyone interested in post-training and reinforcement learning for language models
  • 教科書 Textbook
    • タイトル Title: “Reinforcement Learning from Human Feedback - A short introduction to RLHF and post-training focused on language models.”
    • URL: https://rlhfbook.com/ (PDF / HTML)
    • 著者 Author: Nathan Lambert
  • 発表時間 Presentation Time
    • 30分発表 30-minute presentation
    • 20分議論 20-minute discussion
    • 毎回2名発表 Two presenters each seminar
    • total 100 minutes for two presenters
  • Zoomあり毎回録画予定 Zoom available, seminars will be recorded

スケジュール Schedule

Date Content Presenters Note
2025/06/02 (Mon) 13:30 - 15:10 Seminar 1: Ch. 1-3, Ch. 4-6 Ota, Ichinose Different time for the first session only
2025/06/16 (Mon) 17:15 - 18:55 Seminar 2: Ch. 7, Ch. 8-10 Matsushita, Takahashi  
2025/06/30 (Mon) 17:15 - 18:55 Seminar 3: Ch. 11, Ch. 12 Ma, Shimada  
2025/07/07 (Mon) 17:15 - 18:55 Seminar 4: Presentations on RLHF-related papers (1) Ohi, Saito Not part of bi-weekly schedule
2025/07/14 (Mon) 17:15 - 18:55 Seminar 5: Ch. 13-16, Ch. 17-19 Koike, Onami  
2025/07/28 (Mon) 17:15 - 18:55 Seminar 6: Presentations on RLHF-related papers (2) Mizuki, Katsumata Not part of bi-weekly schedule
2025/08/04 (Mon) 17:15 - 18:55 Seminar 7: Presentations on RLHF-related papers (3) Oba, (Name) One additional presenter possible

今後の予定 Planned Seminars

2025/06/16 (Mon) 17:15 - 18:55 | Ch. 7, Ch. 8-10

  • Optimization Tools 1 (P.37-43, total 7 pages) (Presenter: Matsushita)
  • Optimization Tools 2 (P.44-55, total 12 pages) (Presenter: Takahashi)

2025/06/30 (Mon) 17:15 - 18:55 | Ch. 11, Ch. 12

  • Optimization Tools 3 (P.56-74, total 18 pages) (Presenter: Ma)
  • Optimization Tools 4 (P.75-83, total 8 pages) (Presenter: Shimada)
  • (Presenter: Ohi) 論文は後日決めます.
  • (Presenter: Saito) Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. arXiv. 2024. Paper(GRPOを扱う予定です.発表日・順は適宜調整可能です.)

2025/07/14 (Mon) 17:15 - 18:55 | Ch. 13-16, Ch. 17-19

  • Advanced (P.84-99, total 16 pages) (Presenter: Koike)
  • Open Questions (P.100-111, total 12 pages) (Presenter: Onami)
  • (Presenter: Mizuki) CHEN, Angelica, et al. Preference Learning Algorithms Do Not Learn Preference Rankings. In: NeurIPS 2024. 2024. Paper
  • (Presenter: Katsumata) Rule Based Rewards for Language Model Safety. NeurIPS 2024. Paper (状況に応じて紹介論文変えると思います)
  • (Presenter: Oba) 論文は後ほど決定します。
  • (発表希望 Preferred by: Name)

Past Seminars

2025/06/02 (Mon) 13:30 - 15:10 | Ch. 1-3, Ch. 4-6

  • Introductions (P.5-18, total 14 pages) (Presenter: Ota)
    • [slides], [supplementary slides about RL], [supplementary slides about distributed RL]
  • Problem Setup & Context (P.19-36, total 18 pages) (Presenter: Ichinose)
    • [slides]