Actor-Critic & A3C
Best of Both Worlds
Actor-Critic methods combine the benefits of policy gradient methods (actor) with value function approximation (critic). The actor learns the policy while the critic evaluates actions, reducing variance and improving stability.
The Actor-Critic Architecture
Two neural networks work together:
- Actor π(a|s;θ): Outputs action probabilities
- Critic V(s;w): Estimates state values
The critic helps reduce variance by providing a baseline:
∇_θJ(θ) = E[∇_θ log π(a|s;θ) · A(s,a)]
where A(s,a) = Q(s,a) - V(s) is the advantage function.
Actor-Critic Architecture
See how the actor and critic networks work together. The actor learns the policy while the critic provides value estimates for computing advantages.
Advantage Function
The advantage tells us how much better an action is compared to the average:
TD Error as Advantage
A(s,a) ≈ δ = r + γV(s') - V(s)
Generalized Advantage Estimation (GAE)
Balances bias and variance using exponentially weighted TD errors:
A^GAE = Σ(γλ)^t δ_t
A3C: Asynchronous Advantage Actor-Critic
A3C revolutionized deep RL by using multiple parallel actors to:
- Decorrelate data: Each actor explores different parts of the environment
- Stabilize training: No need for experience replay
- Speed up learning: Parallel collection of experience
A3C Architecture
Multiple workers collect experience in parallel and asynchronously update the global network. Each worker has its own environment instance.
A3C Algorithm
- Initialize global network parameters θ, w
- Launch multiple worker threads
- Each worker:
- Copy global parameters
- Collect trajectory using local policy
- Compute gradients using TD error
- Update global parameters asynchronously
Loss Functions
Actor loss (policy gradient):
L_actor = -log π(a|s) · A(s,a) - β·H(π)
Critic loss (value function):
L_critic = (R - V(s))²
A3C Training Progress
Watch how multiple workers explore different strategies while contributing to a shared global policy. The red line shows the average performance.
A3C Training:
- • Workers explore independently
- • Asynchronous updates to global
- • Natural exploration from parallelism
Advantages over DQN:
- • No experience replay needed
- • Works with any network architecture
- • More robust to hyperparameters
Modern Variants
A2C (Advantage Actor-Critic)
Synchronous version of A3C. Waits for all workers before updating. Often more stable and easier to debug.
SAC (Soft Actor-Critic)
Adds entropy regularization for better exploration. State-of-the-art for continuous control tasks.
IMPALA
Importance Weighted Actor-Learner Architecture. Scales to thousands of workers with off-policy corrections.
Implementation Tips
- Shared layers: Actor and critic can share early layers
- Gradient clipping: Prevent exploding gradients
- Entropy bonus: Encourage exploration
- Learning rate scheduling: Decay over time
- Proper initialization: Critical for convergence
Why Actor-Critic Methods Excel
- Lower variance: Critic provides learned baseline
- Online learning: Can learn from incomplete episodes
- Continuous actions: Natural for continuous control
- Sample efficient: Better than pure policy gradient
- Scalable: Parallelization with A3C/A2C