Value Functions & Bellman Equations

The Heart of RL

Value functions estimate how good it is to be in a given state (or state-action pair). The Bellman equations provide a recursive relationship that these values must satisfy, forming the foundation for most RL algorithms.

Types of Value Functions

State Value Function V(s)

Expected return starting from state s and following policy π:

V^π(s) = E_π[G_t | S_t = s] = E_π[Σ γ^k R_t+k+1 | S_t = s]

Action Value Function Q(s,a)

Expected return starting from state s, taking action a, then following policy π:

Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]

Relationship

V^π(s) = Σ_a π(a|s) Q^π(s,a)

State Value Function Visualization

Colors represent state values V(s). Click cells to see exact values. Toggle policy to see optimal actions derived from values.

Key Insights:

  • • States closer to the goal have higher values
  • • States near the pit have lower values
  • • The optimal policy follows the value gradient
  • • Walls block value propagation

The Bellman Equations

Bellman Expectation Equations

For a given policy π:

V^π(s) = Σ_a π(a|s) Σ_s' P(s'|s,a)[R(s,a,s') + γV^π(s')]
Q^π(s,a) = Σ_s' P(s'|s,a)[R(s,a,s') + γ Σ_a' π(a'|s')Q^π(s',a')]

Bellman Optimality Equations

For the optimal policy π*:

V*(s) = max_a Σ_s' P(s'|s,a)[R(s,a,s') + γV*(s')]
Q*(s,a) = Σ_s' P(s'|s,a)[R(s,a,s') + γ max_a' Q*(s',a')]

Bellman Backup Operation

See how the Bellman equation computes state values by looking ahead one step and backing up the values.

Step:

The Backup Process:

  1. Start at current state s
  2. Consider all possible actions
  3. For each action, consider all possible next states
  4. Weight outcomes by transition probabilities
  5. Add immediate rewards and discounted future values
  6. Take maximum over all actions (for V*)

Solving the Bellman Equations

Dynamic Programming

  • Policy Evaluation: Compute V^π for a given policy
  • Policy Improvement: Improve policy using current value function
  • Policy Iteration: Alternate evaluation and improvement
  • Value Iteration: Combine evaluation and improvement

Sample-Based Methods

  • Monte Carlo: Average complete returns
  • TD Learning: Bootstrap from current estimates
  • Q-Learning: Off-policy TD control

Policy Iteration Algorithm

Watch how policy iteration alternates between evaluating the current policy and improving it based on the computed values.

Observations:

  • • Policy starts random but quickly improves
  • • Values propagate backwards from goal
  • • Convergence typically occurs in 5-10 iterations
  • • Final policy is optimal for this MDP

Convergence Properties

Contraction Mapping

The Bellman operators are contractions with factor γ, guaranteeing convergence to a unique fixed point (the true value function).

Error Bounds

||V_k+1 - V*|| ≤ γ ||V_k - V*||

Practical Implications

  • Geometric convergence rate
  • Discount factor affects speed
  • Initial values don't affect final result

Why Value Functions Matter

  • Decision Making: Optimal actions follow from optimal values
  • Credit Assignment: Propagate rewards to earlier decisions
  • Generalization: Function approximation for large state spaces
  • Transfer Learning: Value knowledge transfers across similar tasks
  • Theoretical Foundation: Convergence guarantees and optimality