Actor-Critic & A3C

Best of Both Worlds

Actor-Critic methods combine the benefits of policy gradient methods (actor) with value function approximation (critic). The actor learns the policy while the critic evaluates actions, reducing variance and improving stability.

The Actor-Critic Architecture

Two neural networks work together:

Actor π(a|s;θ): Outputs action probabilities
Critic V(s;w): Estimates state values

The critic helps reduce variance by providing a baseline:

∇_θJ(θ) = E[∇_θ log π(a|s;θ) · A(s,a)]

where A(s,a) = Q(s,a) - V(s) is the advantage function.

Actor-Critic Architecture

See how the actor and critic networks work together. The actor learns the policy while the critic provides value estimates for computing advantages.

Step 0/6

Advantage Function

The advantage tells us how much better an action is compared to the average:

TD Error as Advantage

A(s,a) ≈ δ = r + γV(s') - V(s)

Generalized Advantage Estimation (GAE)

Balances bias and variance using exponentially weighted TD errors:

A^GAE = Σ(γλ)^t δ_t

A3C: Asynchronous Advantage Actor-Critic

A3C revolutionized deep RL by using multiple parallel actors to:

Decorrelate data: Each actor explores different parts of the environment
Stabilize training: No need for experience replay
Speed up learning: Parallel collection of experience

A3C Architecture

Multiple workers collect experience in parallel and asynchronously update the global network. Each worker has its own environment instance.

A3C Algorithm

Initialize global network parameters θ, w
Launch multiple worker threads
Each worker:
- Copy global parameters
- Collect trajectory using local policy
- Compute gradients using TD error
- Update global parameters asynchronously

Loss Functions

Actor loss (policy gradient):

L_actor = -log π(a|s) · A(s,a) - β·H(π)

Critic loss (value function):

L_critic = (R - V(s))²

A3C Training Progress

Watch how multiple workers explore different strategies while contributing to a shared global policy. The red line shows the average performance.

A3C Training:

• Workers explore independently
• Asynchronous updates to global
• Natural exploration from parallelism

Advantages over DQN:

• No experience replay needed
• Works with any network architecture
• More robust to hyperparameters

Modern Variants

A2C (Advantage Actor-Critic)

Synchronous version of A3C. Waits for all workers before updating. Often more stable and easier to debug.

SAC (Soft Actor-Critic)

Adds entropy regularization for better exploration. State-of-the-art for continuous control tasks.

IMPALA

Importance Weighted Actor-Learner Architecture. Scales to thousands of workers with off-policy corrections.

Implementation Tips

Shared layers: Actor and critic can share early layers
Gradient clipping: Prevent exploding gradients
Entropy bonus: Encourage exploration
Learning rate scheduling: Decay over time
Proper initialization: Critical for convergence

Why Actor-Critic Methods Excel

Lower variance: Critic provides learned baseline
Online learning: Can learn from incomplete episodes
Continuous actions: Natural for continuous control
Sample efficient: Better than pure policy gradient
Scalable: Parallelization with A3C/A2C