Deep Q-Networks (DQN)

Scaling Q-Learning with Deep Learning

DQN revolutionized reinforcement learning by successfully combining Q-learning with deep neural networks. This breakthrough enabled RL to tackle complex problems with high-dimensional state spaces, like playing Atari games from raw pixels.

The Challenge

Traditional Q-learning stores values in a table, which becomes infeasible for:

Large state spaces: Chess has ~10^47 states
Continuous states: Robot joint angles, velocities
High-dimensional inputs: Images, sensor arrays

Solution: Use a neural network to approximate Q(s,a) → Q(s,a;θ)

DQN Network Architecture

The network takes stacked game frames as input and outputs Q-values for each possible action. Hover over components to highlight them.

Input Processing:

• 4 frames stacked for temporal info
• Grayscale to reduce complexity
• Resized to 84×84 pixels

Output Interpretation:

• One Q-value per action
• Higher Q → better action
• Action selection: argmax or ε-greedy

Key Innovations in DQN

1. Experience Replay

Store transitions (s,a,r,s') in a replay buffer and sample random minibatches for training. This breaks correlations and improves sample efficiency.

2. Target Network

Use a separate target network Q(s,a;θ⁻) for computing targets, updated periodically. This stabilizes training by keeping targets fixed.

3. Frame Stacking

Stack multiple frames as input to capture motion and temporal information.

Experience Replay Mechanism

Experience replay stores transitions and samples random minibatches for training. This breaks temporal correlations and enables more stable learning.

Why Experience Replay Works:

• Decorrelation: Sequential experiences are highly correlated
• Efficiency: Reuse each experience multiple times
• Stability: Avoid feedback loops from online learning
• Prioritization: Can sample important experiences more often

The DQN Algorithm

Initialize replay buffer D and networks Q(θ) and Q(θ⁻)
For each episode:
- Select action using ε-greedy policy based on Q(s,a;θ)
- Execute action, observe reward and next state
- Store transition in replay buffer
- Sample random minibatch from D
- Compute target: y = r + γ max_a' Q(s',a';θ⁻)
- Update Q(θ) to minimize (y - Q(s,a;θ))²
- Every C steps: θ⁻ ← θ

DQN Training Process

Watch how DQN training progresses with the main and target networks. The target network is updated periodically to stabilize training.

Main Network θ

• Updated every step
• Minimizes TD error
• Used for action selection

Target Network θ⁻

• Updated every C steps
• Provides stable targets
• Prevents oscillations

Improvements and Variants

Double DQN

Addresses overestimation bias by decoupling action selection and evaluation:

y = r + γ Q(s', argmax_a Q(s',a;θ); θ⁻)

Dueling DQN

Separate value and advantage streams: Q(s,a) = V(s) + A(s,a) - mean(A)

Prioritized Replay

Sample transitions based on TD error magnitude, focusing on surprising experiences.

Rainbow DQN

Combines all improvements: Double, Dueling, Prioritized, Multi-step, Distributional, Noisy.

Applications and Impact

Game Playing: Superhuman performance on Atari, Go, StarCraft
Robotics: Manipulation, locomotion, drone control
Resource Management: Data center cooling, traffic control
Finance: Portfolio optimization, trading strategies
Healthcare: Treatment planning, drug discovery

Why DQN Matters

Breakthrough: First algorithm to learn from pixels to actions
General Purpose: Same algorithm works across many domains
Sample Efficient: Experience replay reuses data effectively
Stable Training: Target networks prevent divergence
Foundation: Spawned numerous improvements and variants