Glad to see the author making a serious effort to fill the gap in public documentation of RLHF theory and practice. The current state of the art seems to be primarily documented in arXiv papers, but each paper is more like a "diff" than a "snapshot" - you need to patch together the knowledge from many previous papers to understand the current state. It's extremely valuable to "snapshot" the current state of the art in a way that is easy to reference.
My friendly feedback on this work-in-progress: I believe it could benefit from more introductory material to establish motivations and set expectations for what is achievable with RLHF. In particular, I think it would be useful to situate RLHF in comparison with supervised fine-tuning (SFT), which readers are likely familiar with.
Stuff I'd cover (from the background of an RLHF user but non-specialist):
Advantages of RLHF over SFT:
- Tunes on the full generation (which is what you ultimately care about), not just token-by-token.
- Can tune on problems where there are many acceptable answers (or ways to word the answer), and you don't want to push the model into one specific series of tokens.
- Can incorporate negative feedback (e.g. don't generate this).
Disadvantages of RLHF over SFT:
- Regularization (KL or otherwise) puts an upper bound on how much impact RLHF can have on the model. Because of this, RLHF is almost never enough to get you "all the way there" by itself.
- Very sensitive to reward model quality, which can be hard to evaluate.
- Much more resource and time intensive.
Non-obvious practical considerations:
- How to evaluate quality? If you have a good measurement of quality, it's tempting to just incorporate it in your reward model. But you want to make sure you're able to measure "is this actually good for my final use-case", not just "does this score well on my reward model?".
- How prompt engineering interacts with fine-tuning (both SFT and RLHF). Often some iteration on the system prompt will make fine-tuning converge faster, and with higher quality. Conversely, attempting to tune on examples that don't include a task-specific prompt (surprisingly common) will often yield subpar results. This is a "boring" implementation detail that I don't normally see included in papers.
Excited to see where this goes, and thanks to the author for willingness to share a work in progress!
gr3ml1n 1 days ago [-]
SFT can be used to give negative feedback/examples. That's one of the lesser-known benefits/tricks of system messages. E.g:
System: You are a helpful chatbot.
User: What is 1+1?
Assistant: 2.
And
System: You are terrible at math.
User: What is 1+1?
Assistant: 0.
cratermoon 15 hours ago [-]
System: It's a lovely morning in the village and you are a horrible goose.
User: Throw the rake into the lake
kadushka 1 days ago [-]
Has r1 made RLHF obsolete?
alexhutcheson 1 days ago [-]
DeepSeek-R1 had an RLHF step in their post-training pipeline (section 2.3.4 of their technical report[1]).
In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data.
If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF.
Yes but these were steps were not used in R1-zero where its reasoning capabilities were trained.
1 days ago [-]
littlestymaar 20 hours ago [-]
And as a result R1-zero is way too crude to be used directly, which is a good indication that it remains relevant.
natolambert 1 days ago [-]
As the other commenter said, R1 required very standard RLHF techniques too.
But a fun way to think about it is that reasoning models are going to be bigger and uplift the RLHF boat.
But we need a few years to establish basics before I can write a cumulative RL for LLMs book ;)
drmindle12358 19 hours ago [-]
You meant to ask "Has r1 made SFT obsolete?" ?
gr3ml1n 1 days ago [-]
This feels like a category mistake. Why would R1 make RLHF obsolete?
cratermoon 15 hours ago [-]
Is there not a survey paper on RLHF equivalent to the "A Survey on Large Language Model based Autonomous Agents" paper?
Someone should get on that.
_giorgio_ 12 hours ago [-]
*
1 point by _giorgio_ 0 minutes ago | next | edit | delete [–]
This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs).
From: Kevin Murphy [view email] [v1] Fri, 6 Dec 2024 18:53:49 UTC (6,099 KB)
natolambert 1 days ago [-]
Author here! Just wanted to say that this is indeed in a good place to share, some very useful stuff, but is also very work in progress. I'm may 60% or so to my first draft. Said progress is coming every day and I happily welcome fixes or suggestions on GitHub.
pknerd 23 hours ago [-]
Thanks. Is there a PDF version? I kind of feel difficulty switching links.
> Reinforcement learning from human feedback (RLHF)
In case anyone else didn’t know the definition.
Knowing the definition it sounds kind of like “learn what we tell you matters” in a sense.
Not unlike how the world seems to work today. High hopes for the future…
npollock 12 hours ago [-]
A quote I found helpful:
"reinforcement learning from human feedback .. is designed to optimize machine learning models in domains where specifically designing a reward function is hard"
This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs).
From: Kevin Murphy [view email]
[v1] Fri, 6 Dec 2024 18:53:49 UTC (6,099 KB)
brcmthrowaway 1 days ago [-]
Whats the difference between RLHF and distillation?
tintor 1 days ago [-]
They are different processes.
- RLHF: Turns pre-trained model (which just performs autocomplete of text) into a model that you can speak with, ie. answer user questions and refuse providing harmful answers.
- Distillation: Transfer skills / knowledge / behavior from one model (and architecture) to a smaller model (and possibly different architecture), by training second model on output log probs of first model.
gr3ml1n 1 days ago [-]
Your description of distillation is largely correct, but not RLHF.
The process of taking a base model that is capable of continuing ('autocomplete') some text input and teaching it to respond to questions in a Q&A chatbot-style format is called instruction tuning. It's pretty much always done via supervised fine-tuning. Otherwise known as: show it a bunch of examples of chat transcripts.
RLHF is more granular and generally one of the last steps in a training pipeline. With RLHF you train a new model, the reward model.
You make that model by having the LLM output a bunch of responses, and then having humans rank the output. E.g.:
Q: What's the Capital of France? A: Paris
Might be scored as `1` by a human, while:
Q: What's the Capital of France? A: Fuck if I know
Would be scored as `0`.
You feed those rankings into the reward model. Then, you have the LLM generate a ton of responses, and have the reward model score it.
If the reward model says it's good, the LLM's output is reinforced, i.e.: it's told 'that was good, more like that'.
If the output scores low, you do the opposite.
Because the reward model is trained based on human preferences, and the reward model is used to reinforce the LLMs output based on those preferences, the whole process is called reinforcement learning from human feedback.
tintor 5 hours ago [-]
Thanks.
Here is presentation by Karpathy explaining different stages of LLM training.
Explains many details in a form suitable for beginners.
> answer user questions and refuse providing harmful answers.
I wonder why this thing can have so much hype. Here is the NewGCC, it's a binary only compiler that refuses to compile applications that it doesn't like... What happened to all the hackers that helped create the open-source movement? Where are they now?
brcmthrowaway 1 days ago [-]
So RLHF is the secret sauce behind modern LLMs?
anon373839 1 days ago [-]
No, this isn't quite right. LLMs are trained in stages:
1. Pre-training. In this stage, the model is trained on a gigantic corpus of web documents, books, papers, etc., and the objective is to predict the next token of each training sample correctly.
2. Supervised fine-tuning. In this stage, the model is shown examples of chat transcripts that are formatted with a chat template. The examples show a user asking a question and an assistant providing an answer. The training objective is the same as in #1: to predict the next token in the training example correctly.
3. Reinforcement learning. Prior to R1, this has mainly taken the form of training a reward model on top of the LLM to steer the model toward arriving at whole sequences that are preferred by human feedback (although AI feedback is a similar reward that is often used instead). There are different ways to do this reward model. When OpenAI first published the technique (probably their last bit of interesting open research?), they were using PPO. There are now a variety of ways to do the reward model, including methods like Direct Preference Optimization that don't use a separate reward model at all and are easier to do.
Stage 1 teaches the model to understand language and imparts world knowledge. Stage 2 teaches the model to act like an assistant. This is where the "magic" is. Stage 3 makes the model do a better job of being an assistant. The traditional analogy is that Stage 1 is the cake; Stage 2 is the frosting; and Stage 3 is the cherry on top.
R1-Zero departs from this "recipe" in that the reasoning magic comes from the reinforcement learning (stage 3). What DeepSeek showed is that, given a reward to produce a correct response, the model will learn to output chain-of-thought material on its own. It will, essentially, develop a chain-of-thought language that helps it accomplish the end goal. This is the most interesting part of the paper, IMO, and it's a result that's already been replicated on smaller base models.
marcosfelt 1 days ago [-]
This is a great summary!
noch 21 hours ago [-]
> So RLHF is the secret sauce behind modern LLMs?
Karpathy wrote[^0]:
"
RL is powerful. RLHF is not.
[…]
And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.
[…]
No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.
RL on any production system is very tricky, and so it seems difficult to work in any open domain, not just LLMs. My suspicion is that RL training is a coalgebra to almost every other form of ML and statistical training, and we don't have a good mathematical understanding how it behaves.
Rendered at 06:29:23 GMT+0000 (Coordinated Universal Time) with Vercel.
My friendly feedback on this work-in-progress: I believe it could benefit from more introductory material to establish motivations and set expectations for what is achievable with RLHF. In particular, I think it would be useful to situate RLHF in comparison with supervised fine-tuning (SFT), which readers are likely familiar with.
Stuff I'd cover (from the background of an RLHF user but non-specialist):
Advantages of RLHF over SFT:
- Tunes on the full generation (which is what you ultimately care about), not just token-by-token.
- Can tune on problems where there are many acceptable answers (or ways to word the answer), and you don't want to push the model into one specific series of tokens.
- Can incorporate negative feedback (e.g. don't generate this).
Disadvantages of RLHF over SFT:
- Regularization (KL or otherwise) puts an upper bound on how much impact RLHF can have on the model. Because of this, RLHF is almost never enough to get you "all the way there" by itself.
- Very sensitive to reward model quality, which can be hard to evaluate.
- Much more resource and time intensive.
Non-obvious practical considerations:
- How to evaluate quality? If you have a good measurement of quality, it's tempting to just incorporate it in your reward model. But you want to make sure you're able to measure "is this actually good for my final use-case", not just "does this score well on my reward model?".
- How prompt engineering interacts with fine-tuning (both SFT and RLHF). Often some iteration on the system prompt will make fine-tuning converge faster, and with higher quality. Conversely, attempting to tune on examples that don't include a task-specific prompt (surprisingly common) will often yield subpar results. This is a "boring" implementation detail that I don't normally see included in papers.
Excited to see where this goes, and thanks to the author for willingness to share a work in progress!
In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data.
If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF.
[1] https://arxiv.org/pdf/2501.12948
But we need a few years to establish basics before I can write a cumulative RL for LLMs book ;)
1 point by _giorgio_ 0 minutes ago | next | edit | delete [–]
https://arxiv.org/abs/2412.05265
Reinforcement Learning: An Overview Kevin Murphy
From: Kevin Murphy [view email] [v1] Fri, 6 Dec 2024 18:53:49 UTC (6,099 KB)https://rlhfbook.com/book.pdf
In case anyone else didn’t know the definition.
Knowing the definition it sounds kind of like “learn what we tell you matters” in a sense.
Not unlike how the world seems to work today. High hopes for the future…
"reinforcement learning from human feedback .. is designed to optimize machine learning models in domains where specifically designing a reward function is hard"
https://rlhfbook.com/c/05-preferences.html
Reinforcement Learning: An Overview Kevin Murphy
From: Kevin Murphy [view email] [v1] Fri, 6 Dec 2024 18:53:49 UTC (6,099 KB)- RLHF: Turns pre-trained model (which just performs autocomplete of text) into a model that you can speak with, ie. answer user questions and refuse providing harmful answers.
- Distillation: Transfer skills / knowledge / behavior from one model (and architecture) to a smaller model (and possibly different architecture), by training second model on output log probs of first model.
The process of taking a base model that is capable of continuing ('autocomplete') some text input and teaching it to respond to questions in a Q&A chatbot-style format is called instruction tuning. It's pretty much always done via supervised fine-tuning. Otherwise known as: show it a bunch of examples of chat transcripts.
RLHF is more granular and generally one of the last steps in a training pipeline. With RLHF you train a new model, the reward model.
You make that model by having the LLM output a bunch of responses, and then having humans rank the output. E.g.:
Might be scored as `1` by a human, while: Would be scored as `0`.You feed those rankings into the reward model. Then, you have the LLM generate a ton of responses, and have the reward model score it.
If the reward model says it's good, the LLM's output is reinforced, i.e.: it's told 'that was good, more like that'.
If the output scores low, you do the opposite.
Because the reward model is trained based on human preferences, and the reward model is used to reinforce the LLMs output based on those preferences, the whole process is called reinforcement learning from human feedback.
Here is presentation by Karpathy explaining different stages of LLM training. Explains many details in a form suitable for beginners.
https://www.youtube.com/watch?v=bZQun8Y4L2A
I wonder why this thing can have so much hype. Here is the NewGCC, it's a binary only compiler that refuses to compile applications that it doesn't like... What happened to all the hackers that helped create the open-source movement? Where are they now?
1. Pre-training. In this stage, the model is trained on a gigantic corpus of web documents, books, papers, etc., and the objective is to predict the next token of each training sample correctly.
2. Supervised fine-tuning. In this stage, the model is shown examples of chat transcripts that are formatted with a chat template. The examples show a user asking a question and an assistant providing an answer. The training objective is the same as in #1: to predict the next token in the training example correctly.
3. Reinforcement learning. Prior to R1, this has mainly taken the form of training a reward model on top of the LLM to steer the model toward arriving at whole sequences that are preferred by human feedback (although AI feedback is a similar reward that is often used instead). There are different ways to do this reward model. When OpenAI first published the technique (probably their last bit of interesting open research?), they were using PPO. There are now a variety of ways to do the reward model, including methods like Direct Preference Optimization that don't use a separate reward model at all and are easier to do.
Stage 1 teaches the model to understand language and imparts world knowledge. Stage 2 teaches the model to act like an assistant. This is where the "magic" is. Stage 3 makes the model do a better job of being an assistant. The traditional analogy is that Stage 1 is the cake; Stage 2 is the frosting; and Stage 3 is the cherry on top.
R1-Zero departs from this "recipe" in that the reasoning magic comes from the reinforcement learning (stage 3). What DeepSeek showed is that, given a reward to produce a correct response, the model will learn to output chain-of-thought material on its own. It will, essentially, develop a chain-of-thought language that helps it accomplish the end goal. This is the most interesting part of the paper, IMO, and it's a result that's already been replicated on smaller base models.
Karpathy wrote[^0]:
"
RL is powerful. RLHF is not.
[…]
And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.
[…]
No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.
"
---
[^0]: https://x.com/karpathy/status/1821277264996352246