Policy Gradient Methods

Learning Policies Directly

While value-based methods like Q-learning learn a value function and derive a policy from it, policy gradient methods directly optimize the policy parameters. This approach is particularly powerful for continuous action spaces and stochastic policies.

Core Idea: Gradient Ascent on Expected Return

Policy gradient methods parameterize the policy π(a|s;θ) and optimize the parameters θ to maximize the expected return:

J(θ) = E[R | π_θ]

We use gradient ascent to improve the policy:

θ ← θ + α∇_θJ(θ)

Policy Gradient in Action

See how policy gradients shift the policy distribution towards high-reward actions. The policy is a Gaussian distribution over actions.

How it works:

Sample actions from current policy π(a|s)
Observe rewards (green = +1, red = -1)
Gradient points towards high-reward actions
Update shifts policy mean towards better actions

The Policy Gradient Theorem

The fundamental result that makes policy gradient methods possible:

∇_θJ(θ) = E[∇_θ log π(a|s;θ) · Q^π(s,a)]

This tells us how to estimate gradients using samples from the policy itself.

Key Insights

Log probability trick: ∇_θ log π increases probability of good actions
Weighted by returns: Better actions get larger gradient updates
On-policy: Must sample from current policy π_θ

REINFORCE Algorithm

Watch how REINFORCE learns a policy to reach the goal. The arrow opacity shows action probabilities at each state.

REINFORCE Update:

θ ← θ + α∇log π(a|s) · G_t

G_t is the return from time t

Observations:

• High variance in early training
• Policy converges to optimal path
• Exploration decreases over time

REINFORCE Algorithm

The simplest policy gradient algorithm uses Monte Carlo returns:

Sample trajectory τ ~ π_θ
Calculate returns G_t for each step
Update: θ ← θ + α Σ_t ∇_θ log π(a_t|s_t;θ) · G_t

Variance Reduction

REINFORCE suffers from high variance. Common solutions:

Baseline subtraction: Use G_t - b(s_t) instead of G_t
Advantage estimation: Use A(s,a) = Q(s,a) - V(s)
Temporal structure: Use TD methods instead of Monte Carlo

Advanced Policy Gradient Methods

1. Natural Policy Gradient

Accounts for the geometry of the policy space using the Fisher information matrix. Provides more stable and efficient updates.

2. Trust Region Policy Optimization (TRPO)

Constrains policy updates to ensure monotonic improvement by limiting KL divergence between old and new policies.

3. Proximal Policy Optimization (PPO)

Simplifies TRPO using a clipped surrogate objective. Currently one of the most popular and effective RL algorithms.

PPO Clipping Mechanism

PPO prevents large policy updates by clipping the objective function. This ensures stable training while maintaining good sample efficiency.

Clip Parameter (ε): 0.20

PPO Benefits:

• Prevents destructively large policy updates
• Simpler than TRPO (no conjugate gradient)
• Excellent performance on continuous control
• Can reuse data multiple times per update

Continuous Action Spaces

Policy gradient methods excel at continuous control tasks. Common parameterizations:

Gaussian policy: π(a|s) = N(μ(s;θ), σ²)
Beta distribution: For bounded actions
Mixture of Gaussians: For multimodal action distributions

Why Policy Gradients Matter

Continuous actions: Natural handling of continuous control
Stochastic policies: Can learn to explore and handle uncertainty
Convergence: Guaranteed convergence to local optimum
Differentiable: End-to-end learning with neural networks
High-dimensional: Scales to complex action spaces