Q-Learning & SARSA
Model-Free Reinforcement Learning
Q-Learning is a model-free algorithm that learns the optimal action-value function Q* directly from experience, without knowing the environment's dynamics. It's one of the most important algorithms in RL, forming the foundation for modern deep RL methods.
The Q-Function
Q(s,a) represents the expected cumulative reward of taking action a in state s, then following the optimal policy thereafter:
Q*(s,a) = E[R_t+1 + γ max_a' Q*(s',a') | s_t=s, a_t=a]
Once we have Q*, the optimal policy is simply: π*(s) = argmax_a Q*(s,a)
Q-Learning in Action
Watch as the agent learns Q-values through exploration. Arrows show learned Q-values for each action. The agent uses ε-greedy exploration.
Observations:
- • Q-values propagate backwards from rewards
- • Exploration (ε) decreases over time
- • Optimal path emerges as Q-values converge
- • Negative Q-values appear near the pit
The Q-Learning Algorithm
- Initialize Q(s,a) arbitrarily (often to 0)
- For each episode:
- Choose action using ε-greedy policy
- Take action, observe reward and next state
- Update Q-value using the update rule
Update Rule:
Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]
Where α is the learning rate and γ is the discount factor.
Q-Learning vs SARSA
Q-Learning (Off-Policy)
- • Updates using max Q(s',a')
- • Learns optimal policy
- • Can be more unstable
- • Better final performance
SARSA (On-Policy)
- • Updates using actual next action
- • Learns policy being followed
- • More stable learning
- • Safer for online learning
Q-Learning vs SARSA
The key difference: Q-Learning uses the maximum Q-value of the next state (off-policy), while SARSA uses the Q-value of the actual next action taken (on-policy).
Q-Learning
Always updates towards the best possible action, even if the agent doesn't take it. This can lead to faster convergence but may be overoptimistic.
SARSA
Updates based on the actual policy being followed, including exploration. More conservative and stable, especially important for risky environments.
Exploration vs Exploitation
ε-Greedy Policy
With probability ε, take a random action (explore). Otherwise, take the best known action (exploit). ε typically decays over time.
Other Strategies
- Boltzmann Exploration: Sample actions based on Q-values
- UCB: Upper Confidence Bound - optimism in face of uncertainty
- Thompson Sampling: Bayesian approach to exploration
Convergence & Challenges
Convergence Conditions
- All state-action pairs must be visited infinitely often
- Learning rate must decay appropriately
- Environment must be stationary
Common Challenges
- Large State Spaces: Need function approximation
- Continuous Actions: Discretization or policy gradient methods
- Sample Efficiency: Requires many interactions
- Overestimation Bias: Max operator can overestimate values
Why Q-Learning Matters
- Foundation: Basis for DQN and modern deep RL
- Simplicity: Easy to understand and implement
- Model-Free: No need to know environment dynamics
- Optimal: Guaranteed to find optimal policy
- Practical: Works well in many real applications