Markov Decision Processes (MDPs)

Foundation of Reinforcement Learning

MDPs provide the mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of an agent. They form the theoretical foundation for virtually all reinforcement learning algorithms.

Core Components

  • States (S): All possible situations the agent can be in
  • Actions (A): All possible moves the agent can make
  • Transition Function P(s'|s,a): Probability of reaching state s' from state s via action a
  • Reward Function R(s,a,s'): Immediate reward for transitions
  • Discount Factor (γ): How much to value future rewards (0 ≤ γ ≤ 1)

MDP Grid World

Navigate the agent using arrow keys or buttons. Reach the green goal (+1) while avoiding the red pit (-1). Each step costs -0.04 to encourage finding the shortest path.

Episode Summary:

Steps taken: 0

Total reward: 0.00

The Markov Property

The key assumption: the future depends only on the current state, not on the history of how we got there. Mathematically:

P(s_t+1 | s_t, a_t, s_t-1, a_t-1, ...) = P(s_t+1 | s_t, a_t)

This memoryless property makes MDPs computationally tractable while still being expressive enough for many real-world problems.

Policies and Goals

Policy (π)

A policy defines the agent's behavior - what action to take in each state. It can be deterministic π(s) → a or stochastic π(a|s) → [0,1].

Objective

Find the optimal policy π* that maximizes expected cumulative reward:

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ... = Σ γ^k R_t+k+1

Stochastic Transition Dynamics

Select a state and action to visualize transition probabilities. Arrow thickness represents probability magnitude.

Key Insight:

Actions don't guarantee outcomes - they define probability distributions over next states. This stochasticity is what makes MDPs challenging and realistic.

Types of MDPs

Finite MDP

Finite number of states and actions - most common in theoretical analysis.

Continuous MDP

Continuous state/action spaces - common in robotics and control.

Partially Observable MDP (POMDP)

Agent doesn't fully observe the state - adds belief states and observations.

Multi-Agent MDP

Multiple agents with potentially conflicting objectives.

Solution Methods

  • Dynamic Programming: When model is known (value/policy iteration)
  • Monte Carlo: Learn from complete episodes
  • Temporal Difference: Learn from partial episodes (Q-learning, SARSA)
  • Policy Gradient: Directly optimize the policy

Discount Factor Impact

The discount factor γ determines how much the agent values future rewards. γ = 0 means only immediate rewards matter, γ = 1 means all rewards are equally important.

Low γ (≈ 0)

Myopic agent, focuses on immediate rewards

High γ (≈ 1)

Far-sighted agent, values future rewards

Why MDPs Matter

  • Universal Framework: Can model any sequential decision problem
  • Optimal Solutions: Guaranteed to find best policy (in theory)
  • Foundation: All modern RL algorithms build on MDP theory
  • Real Applications: Robotics, game AI, resource management, finance
  • Bridge to Deep RL: Neural networks approximate MDP components