Policy Gradient Methods
Learning Policies Directly
While value-based methods like Q-learning learn a value function and derive a policy from it, policy gradient methods directly optimize the policy parameters. This approach is particularly powerful for continuous action spaces and stochastic policies.
Core Idea: Gradient Ascent on Expected Return
Policy gradient methods parameterize the policy π(a|s;θ) and optimize the parameters θ to maximize the expected return:
J(θ) = E[R | π_θ]
We use gradient ascent to improve the policy:
θ ← θ + α∇_θJ(θ)
Policy Gradient in Action
See how policy gradients shift the policy distribution towards high-reward actions. The policy is a Gaussian distribution over actions.
How it works:
- Sample actions from current policy π(a|s)
- Observe rewards (green = +1, red = -1)
- Gradient points towards high-reward actions
- Update shifts policy mean towards better actions
The Policy Gradient Theorem
The fundamental result that makes policy gradient methods possible:
∇_θJ(θ) = E[∇_θ log π(a|s;θ) · Q^π(s,a)]
This tells us how to estimate gradients using samples from the policy itself.
Key Insights
- Log probability trick: ∇_θ log π increases probability of good actions
- Weighted by returns: Better actions get larger gradient updates
- On-policy: Must sample from current policy π_θ
REINFORCE Algorithm
Watch how REINFORCE learns a policy to reach the goal. The arrow opacity shows action probabilities at each state.
REINFORCE Update:
θ ← θ + α∇log π(a|s) · G_t
G_t is the return from time t
Observations:
- • High variance in early training
- • Policy converges to optimal path
- • Exploration decreases over time
REINFORCE Algorithm
The simplest policy gradient algorithm uses Monte Carlo returns:
- Sample trajectory τ ~ π_θ
- Calculate returns G_t for each step
- Update: θ ← θ + α Σ_t ∇_θ log π(a_t|s_t;θ) · G_t
Variance Reduction
REINFORCE suffers from high variance. Common solutions:
- Baseline subtraction: Use G_t - b(s_t) instead of G_t
- Advantage estimation: Use A(s,a) = Q(s,a) - V(s)
- Temporal structure: Use TD methods instead of Monte Carlo
Advanced Policy Gradient Methods
1. Natural Policy Gradient
Accounts for the geometry of the policy space using the Fisher information matrix. Provides more stable and efficient updates.
2. Trust Region Policy Optimization (TRPO)
Constrains policy updates to ensure monotonic improvement by limiting KL divergence between old and new policies.
3. Proximal Policy Optimization (PPO)
Simplifies TRPO using a clipped surrogate objective. Currently one of the most popular and effective RL algorithms.
PPO Clipping Mechanism
PPO prevents large policy updates by clipping the objective function. This ensures stable training while maintaining good sample efficiency.
PPO Benefits:
- • Prevents destructively large policy updates
- • Simpler than TRPO (no conjugate gradient)
- • Excellent performance on continuous control
- • Can reuse data multiple times per update
Continuous Action Spaces
Policy gradient methods excel at continuous control tasks. Common parameterizations:
- Gaussian policy: π(a|s) = N(μ(s;θ), σ²)
- Beta distribution: For bounded actions
- Mixture of Gaussians: For multimodal action distributions
Why Policy Gradients Matter
- Continuous actions: Natural handling of continuous control
- Stochastic policies: Can learn to explore and handle uncertainty
- Convergence: Guaranteed convergence to local optimum
- Differentiable: End-to-end learning with neural networks
- High-dimensional: Scales to complex action spaces