A: RLHF involves collecting human preferences for how a model should respond to prompts. This preference data is used to train a separate neural network called a reward model, which grades the language model’s outputs with a predicted “human satisfaction” score.