Value Functions & Bellman Equations

The Heart of RL

Value functions estimate how good it is to be in a given state (or state-action pair). The Bellman equations provide a recursive relationship that these values must satisfy, forming the foundation for most RL algorithms.

Types of Value Functions

State Value Function V(s)

Expected return starting from state s and following policy π:

V^π(s) = E_π[G_t | S_t = s] = E_π[Σ γ^k R_t+k+1 | S_t = s]

Action Value Function Q(s,a)

Expected return starting from state s, taking action a, then following policy π:

Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]

Relationship

V^π(s) = Σ_a π(a|s) Q^π(s,a)

State Value Function Visualization

Colors represent state values V(s). Click cells to see exact values. Toggle policy to see optimal actions derived from values.

Key Insights:

• States closer to the goal have higher values
• States near the pit have lower values
• The optimal policy follows the value gradient
• Walls block value propagation

The Bellman Equations

Bellman Expectation Equations

For a given policy π:

V^π(s) = Σ_a π(a|s) Σ_s' P(s'|s,a)[R(s,a,s') + γV^π(s')]

Q^π(s,a) = Σ_s' P(s'|s,a)[R(s,a,s') + γ Σ_a' π(a'|s')Q^π(s',a')]

Bellman Optimality Equations

For the optimal policy π*:

V*(s) = max_a Σ_s' P(s'|s,a)[R(s,a,s') + γV*(s')]

Q*(s,a) = Σ_s' P(s'|s,a)[R(s,a,s') + γ max_a' Q*(s',a')]

Bellman Backup Operation

See how the Bellman equation computes state values by looking ahead one step and backing up the values.

Step:

The Backup Process:

Start at current state s
Consider all possible actions
For each action, consider all possible next states
Weight outcomes by transition probabilities
Add immediate rewards and discounted future values
Take maximum over all actions (for V*)

Solving the Bellman Equations

Dynamic Programming

Policy Evaluation: Compute V^π for a given policy
Policy Improvement: Improve policy using current value function
Policy Iteration: Alternate evaluation and improvement
Value Iteration: Combine evaluation and improvement

Sample-Based Methods

Monte Carlo: Average complete returns
TD Learning: Bootstrap from current estimates
Q-Learning: Off-policy TD control

Policy Iteration Algorithm

Watch how policy iteration alternates between evaluating the current policy and improving it based on the computed values.

Observations:

• Policy starts random but quickly improves
• Values propagate backwards from goal
• Convergence typically occurs in 5-10 iterations
• Final policy is optimal for this MDP

Convergence Properties

Contraction Mapping

The Bellman operators are contractions with factor γ, guaranteeing convergence to a unique fixed point (the true value function).

Error Bounds

||V_k+1 - V*|| ≤ γ ||V_k - V*||

Practical Implications

Geometric convergence rate
Discount factor affects speed
Initial values don't affect final result

Why Value Functions Matter

Decision Making: Optimal actions follow from optimal values
Credit Assignment: Propagate rewards to earlier decisions
Generalization: Function approximation for large state spaces
Transfer Learning: Value knowledge transfers across similar tasks
Theoretical Foundation: Convergence guarantees and optimality