Value Functions & Bellman Equations
The Heart of RL
Value functions estimate how good it is to be in a given state (or state-action pair). The Bellman equations provide a recursive relationship that these values must satisfy, forming the foundation for most RL algorithms.
Types of Value Functions
State Value Function V(s)
Expected return starting from state s and following policy π:
V^π(s) = E_π[G_t | S_t = s] = E_π[Σ γ^k R_t+k+1 | S_t = s]
Action Value Function Q(s,a)
Expected return starting from state s, taking action a, then following policy π:
Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]
Relationship
V^π(s) = Σ_a π(a|s) Q^π(s,a)
State Value Function Visualization
Colors represent state values V(s). Click cells to see exact values. Toggle policy to see optimal actions derived from values.
Key Insights:
- • States closer to the goal have higher values
- • States near the pit have lower values
- • The optimal policy follows the value gradient
- • Walls block value propagation
The Bellman Equations
Bellman Expectation Equations
For a given policy π:
V^π(s) = Σ_a π(a|s) Σ_s' P(s'|s,a)[R(s,a,s') + γV^π(s')]
Q^π(s,a) = Σ_s' P(s'|s,a)[R(s,a,s') + γ Σ_a' π(a'|s')Q^π(s',a')]
Bellman Optimality Equations
For the optimal policy π*:
V*(s) = max_a Σ_s' P(s'|s,a)[R(s,a,s') + γV*(s')]
Q*(s,a) = Σ_s' P(s'|s,a)[R(s,a,s') + γ max_a' Q*(s',a')]
Bellman Backup Operation
See how the Bellman equation computes state values by looking ahead one step and backing up the values.
The Backup Process:
- Start at current state s
- Consider all possible actions
- For each action, consider all possible next states
- Weight outcomes by transition probabilities
- Add immediate rewards and discounted future values
- Take maximum over all actions (for V*)
Solving the Bellman Equations
Dynamic Programming
- Policy Evaluation: Compute V^π for a given policy
- Policy Improvement: Improve policy using current value function
- Policy Iteration: Alternate evaluation and improvement
- Value Iteration: Combine evaluation and improvement
Sample-Based Methods
- Monte Carlo: Average complete returns
- TD Learning: Bootstrap from current estimates
- Q-Learning: Off-policy TD control
Policy Iteration Algorithm
Watch how policy iteration alternates between evaluating the current policy and improving it based on the computed values.
Observations:
- • Policy starts random but quickly improves
- • Values propagate backwards from goal
- • Convergence typically occurs in 5-10 iterations
- • Final policy is optimal for this MDP
Convergence Properties
Contraction Mapping
The Bellman operators are contractions with factor γ, guaranteeing convergence to a unique fixed point (the true value function).
Error Bounds
||V_k+1 - V*|| ≤ γ ||V_k - V*||
Practical Implications
- Geometric convergence rate
- Discount factor affects speed
- Initial values don't affect final result
Why Value Functions Matter
- Decision Making: Optimal actions follow from optimal values
- Credit Assignment: Propagate rewards to earlier decisions
- Generalization: Function approximation for large state spaces
- Transfer Learning: Value knowledge transfers across similar tasks
- Theoretical Foundation: Convergence guarantees and optimality