Q-Learning & SARSA

Model-Free Reinforcement Learning

Q-Learning is a model-free algorithm that learns the optimal action-value function Q* directly from experience, without knowing the environment's dynamics. It's one of the most important algorithms in RL, forming the foundation for modern deep RL methods.

The Q-Function

Q(s,a) represents the expected cumulative reward of taking action a in state s, then following the optimal policy thereafter:

Q*(s,a) = E[R_t+1 + γ max_a' Q*(s',a') | s_t=s, a_t=a]

Once we have Q*, the optimal policy is simply: π*(s) = argmax_a Q*(s,a)

Q-Learning in Action

Watch as the agent learns Q-values through exploration. Arrows show learned Q-values for each action. The agent uses ε-greedy exploration.

Observations:

• Q-values propagate backwards from rewards
• Exploration (ε) decreases over time
• Optimal path emerges as Q-values converge
• Negative Q-values appear near the pit

The Q-Learning Algorithm

Initialize Q(s,a) arbitrarily (often to 0)
For each episode:
- Choose action using ε-greedy policy
- Take action, observe reward and next state
- Update Q-value using the update rule

Update Rule:

Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]

Where α is the learning rate and γ is the discount factor.

Q-Learning vs SARSA

Q-Learning (Off-Policy)

• Updates using max Q(s',a')
• Learns optimal policy
• Can be more unstable
• Better final performance

SARSA (On-Policy)

• Updates using actual next action
• Learns policy being followed
• More stable learning
• Safer for online learning

Q-Learning vs SARSA

The key difference: Q-Learning uses the maximum Q-value of the next state (off-policy), while SARSA uses the Q-value of the actual next action taken (on-policy).

Q-Learning

Always updates towards the best possible action, even if the agent doesn't take it. This can lead to faster convergence but may be overoptimistic.

SARSA

Updates based on the actual policy being followed, including exploration. More conservative and stable, especially important for risky environments.

Exploration vs Exploitation

ε-Greedy Policy

With probability ε, take a random action (explore). Otherwise, take the best known action (exploit). ε typically decays over time.

Other Strategies

Boltzmann Exploration: Sample actions based on Q-values
UCB: Upper Confidence Bound - optimism in face of uncertainty
Thompson Sampling: Bayesian approach to exploration

Convergence & Challenges

Convergence Conditions

All state-action pairs must be visited infinitely often
Learning rate must decay appropriately
Environment must be stationary

Common Challenges

Large State Spaces: Need function approximation
Continuous Actions: Discretization or policy gradient methods
Sample Efficiency: Requires many interactions
Overestimation Bias: Max operator can overestimate values

Why Q-Learning Matters

Foundation: Basis for DQN and modern deep RL
Simplicity: Easy to understand and implement
Model-Free: No need to know environment dynamics
Optimal: Guaranteed to find optimal policy
Practical: Works well in many real applications