Actor-Critic & A3C

Best of Both Worlds

Actor-Critic methods combine the benefits of policy gradient methods (actor) with value function approximation (critic). The actor learns the policy while the critic evaluates actions, reducing variance and improving stability.

The Actor-Critic Architecture

Two neural networks work together:

  • Actor π(a|s;θ): Outputs action probabilities
  • Critic V(s;w): Estimates state values

The critic helps reduce variance by providing a baseline:

∇_θJ(θ) = E[∇_θ log π(a|s;θ) · A(s,a)]

where A(s,a) = Q(s,a) - V(s) is the advantage function.

Actor-Critic Architecture

See how the actor and critic networks work together. The actor learns the policy while the critic provides value estimates for computing advantages.

Step 0/6

Advantage Function

The advantage tells us how much better an action is compared to the average:

TD Error as Advantage

A(s,a) ≈ δ = r + γV(s') - V(s)

Generalized Advantage Estimation (GAE)

Balances bias and variance using exponentially weighted TD errors:

A^GAE = Σ(γλ)^t δ_t

A3C: Asynchronous Advantage Actor-Critic

A3C revolutionized deep RL by using multiple parallel actors to:

  • Decorrelate data: Each actor explores different parts of the environment
  • Stabilize training: No need for experience replay
  • Speed up learning: Parallel collection of experience

A3C Architecture

Multiple workers collect experience in parallel and asynchronously update the global network. Each worker has its own environment instance.

A3C Algorithm

  1. Initialize global network parameters θ, w
  2. Launch multiple worker threads
  3. Each worker:
    • Copy global parameters
    • Collect trajectory using local policy
    • Compute gradients using TD error
    • Update global parameters asynchronously

Loss Functions

Actor loss (policy gradient):

L_actor = -log π(a|s) · A(s,a) - β·H(π)

Critic loss (value function):

L_critic = (R - V(s))²

A3C Training Progress

Watch how multiple workers explore different strategies while contributing to a shared global policy. The red line shows the average performance.

A3C Training:

  • • Workers explore independently
  • • Asynchronous updates to global
  • • Natural exploration from parallelism

Advantages over DQN:

  • • No experience replay needed
  • • Works with any network architecture
  • • More robust to hyperparameters

Modern Variants

A2C (Advantage Actor-Critic)

Synchronous version of A3C. Waits for all workers before updating. Often more stable and easier to debug.

SAC (Soft Actor-Critic)

Adds entropy regularization for better exploration. State-of-the-art for continuous control tasks.

IMPALA

Importance Weighted Actor-Learner Architecture. Scales to thousands of workers with off-policy corrections.

Implementation Tips

  • Shared layers: Actor and critic can share early layers
  • Gradient clipping: Prevent exploding gradients
  • Entropy bonus: Encourage exploration
  • Learning rate scheduling: Decay over time
  • Proper initialization: Critical for convergence

Why Actor-Critic Methods Excel

  • Lower variance: Critic provides learned baseline
  • Online learning: Can learn from incomplete episodes
  • Continuous actions: Natural for continuous control
  • Sample efficient: Better than pure policy gradient
  • Scalable: Parallelization with A3C/A2C