Deep Q-Networks (DQN)
Scaling Q-Learning with Deep Learning
DQN revolutionized reinforcement learning by successfully combining Q-learning with deep neural networks. This breakthrough enabled RL to tackle complex problems with high-dimensional state spaces, like playing Atari games from raw pixels.
The Challenge
Traditional Q-learning stores values in a table, which becomes infeasible for:
- Large state spaces: Chess has ~10^47 states
- Continuous states: Robot joint angles, velocities
- High-dimensional inputs: Images, sensor arrays
Solution: Use a neural network to approximate Q(s,a) → Q(s,a;θ)
DQN Network Architecture
The network takes stacked game frames as input and outputs Q-values for each possible action. Hover over components to highlight them.
Input Processing:
- • 4 frames stacked for temporal info
- • Grayscale to reduce complexity
- • Resized to 84×84 pixels
Output Interpretation:
- • One Q-value per action
- • Higher Q → better action
- • Action selection: argmax or ε-greedy
Key Innovations in DQN
1. Experience Replay
Store transitions (s,a,r,s') in a replay buffer and sample random minibatches for training. This breaks correlations and improves sample efficiency.
2. Target Network
Use a separate target network Q(s,a;θ⁻) for computing targets, updated periodically. This stabilizes training by keeping targets fixed.
3. Frame Stacking
Stack multiple frames as input to capture motion and temporal information.
Experience Replay Mechanism
Experience replay stores transitions and samples random minibatches for training. This breaks temporal correlations and enables more stable learning.
Why Experience Replay Works:
- • Decorrelation: Sequential experiences are highly correlated
- • Efficiency: Reuse each experience multiple times
- • Stability: Avoid feedback loops from online learning
- • Prioritization: Can sample important experiences more often
The DQN Algorithm
- Initialize replay buffer D and networks Q(θ) and Q(θ⁻)
- For each episode:
- Select action using ε-greedy policy based on Q(s,a;θ)
- Execute action, observe reward and next state
- Store transition in replay buffer
- Sample random minibatch from D
- Compute target: y = r + γ max_a' Q(s',a';θ⁻)
- Update Q(θ) to minimize (y - Q(s,a;θ))²
- Every C steps: θ⁻ ← θ
DQN Training Process
Watch how DQN training progresses with the main and target networks. The target network is updated periodically to stabilize training.
Main Network θ
- • Updated every step
- • Minimizes TD error
- • Used for action selection
Target Network θ⁻
- • Updated every C steps
- • Provides stable targets
- • Prevents oscillations
Improvements and Variants
Double DQN
Addresses overestimation bias by decoupling action selection and evaluation:
y = r + γ Q(s', argmax_a Q(s',a;θ); θ⁻)
Dueling DQN
Separate value and advantage streams: Q(s,a) = V(s) + A(s,a) - mean(A)
Prioritized Replay
Sample transitions based on TD error magnitude, focusing on surprising experiences.
Rainbow DQN
Combines all improvements: Double, Dueling, Prioritized, Multi-step, Distributional, Noisy.
Applications and Impact
- Game Playing: Superhuman performance on Atari, Go, StarCraft
- Robotics: Manipulation, locomotion, drone control
- Resource Management: Data center cooling, traffic control
- Finance: Portfolio optimization, trading strategies
- Healthcare: Treatment planning, drug discovery
Why DQN Matters
- Breakthrough: First algorithm to learn from pixels to actions
- General Purpose: Same algorithm works across many domains
- Sample Efficient: Experience replay reuses data effectively
- Stable Training: Target networks prevent divergence
- Foundation: Spawned numerous improvements and variants