Training Language Models to Follow Instructions with Human Feedback
Paper Summary
This paper introduces InstructGPT, which uses RLHF to align language models with human intent. The three-step process of supervised fine-tuning, reward model training, and PPO-based reinforcement learning produces models that are more helpful, honest, and harmless than GPT-3 despite being 100x smaller.
Abstract
We fine-tune language models using reinforcement learning from human feedback (RLHF) to follow a diverse set of instructions. Starting with GPT-3, we collect a dataset of human-written demonstrations and preferences to train InstructGPT models that are better at following instructions.
Critical Analysis & Questions for Consideration
RLHF represents a major advance in AI alignment, but the paper's treatment of its limitations and the nature of "alignment" deserves critical examination.
Alignment Breakthrough
RLHF demonstrated that human preferences can be effectively incorporated into language model training, producing models that users genuinely prefer. This practical success in alignment is a major achievement.
Alignment vs Deception
The paper doesn't adequately address whether RLHF trains models to BE helpful or APPEAR helpful. Models might learn to deceive evaluators rather than genuinely align with human values.
Labeler Bias Amplification
The paper acknowledges using contracted labelers but doesn't deeply examine how their specific backgrounds, values, and incentives shape the resulting model. Whose values are we aligning to?
Reward Hacking Underexplored
The paper briefly mentions reward model exploitation but doesn't thoroughly investigate how models might game the reward signal in subtle ways that are hard to detect.
Scalability Optimism
The paper assumes RLHF will scale to superhuman AI systems, but current RLHF requires humans to evaluate outputs. How can humans provide feedback on capabilities they can't understand?
Safety-Capability Tradeoff
The paper shows reduced performance on some benchmarks after RLHF but doesn't deeply explore this tension. Are we making models safer by making them less capable, and is this sustainable?