Markov Decision Processes (MDPs)
Foundation of Reinforcement Learning
MDPs provide the mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of an agent. They form the theoretical foundation for virtually all reinforcement learning algorithms.
Core Components
- States (S): All possible situations the agent can be in
- Actions (A): All possible moves the agent can make
- Transition Function P(s'|s,a): Probability of reaching state s' from state s via action a
- Reward Function R(s,a,s'): Immediate reward for transitions
- Discount Factor (γ): How much to value future rewards (0 ≤ γ ≤ 1)
MDP Grid World
Navigate the agent using arrow keys or buttons. Reach the green goal (+1) while avoiding the red pit (-1). Each step costs -0.04 to encourage finding the shortest path.
Episode Summary:
Steps taken: 0
Total reward: 0.00
The Markov Property
The key assumption: the future depends only on the current state, not on the history of how we got there. Mathematically:
P(s_t+1 | s_t, a_t, s_t-1, a_t-1, ...) = P(s_t+1 | s_t, a_t)
This memoryless property makes MDPs computationally tractable while still being expressive enough for many real-world problems.
Policies and Goals
Policy (π)
A policy defines the agent's behavior - what action to take in each state. It can be deterministic π(s) → a or stochastic π(a|s) → [0,1].
Objective
Find the optimal policy π* that maximizes expected cumulative reward:
G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ... = Σ γ^k R_t+k+1
Stochastic Transition Dynamics
Select a state and action to visualize transition probabilities. Arrow thickness represents probability magnitude.
Key Insight:
Actions don't guarantee outcomes - they define probability distributions over next states. This stochasticity is what makes MDPs challenging and realistic.
Types of MDPs
Finite MDP
Finite number of states and actions - most common in theoretical analysis.
Continuous MDP
Continuous state/action spaces - common in robotics and control.
Partially Observable MDP (POMDP)
Agent doesn't fully observe the state - adds belief states and observations.
Multi-Agent MDP
Multiple agents with potentially conflicting objectives.
Solution Methods
- Dynamic Programming: When model is known (value/policy iteration)
- Monte Carlo: Learn from complete episodes
- Temporal Difference: Learn from partial episodes (Q-learning, SARSA)
- Policy Gradient: Directly optimize the policy
Discount Factor Impact
The discount factor γ determines how much the agent values future rewards. γ = 0 means only immediate rewards matter, γ = 1 means all rewards are equally important.
Low γ (≈ 0)
Myopic agent, focuses on immediate rewards
High γ (≈ 1)
Far-sighted agent, values future rewards
Why MDPs Matter
- Universal Framework: Can model any sequential decision problem
- Optimal Solutions: Guaranteed to find best policy (in theory)
- Foundation: All modern RL algorithms build on MDP theory
- Real Applications: Robotics, game AI, resource management, finance
- Bridge to Deep RL: Neural networks approximate MDP components