Markov Decision Processes (MDPs)

Foundation of Reinforcement Learning

MDPs provide the mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of an agent. They form the theoretical foundation for virtually all reinforcement learning algorithms.

Core Components

States (S): All possible situations the agent can be in
Actions (A): All possible moves the agent can make
Transition Function P(s'|s,a): Probability of reaching state s' from state s via action a
Reward Function R(s,a,s'): Immediate reward for transitions
Discount Factor (γ): How much to value future rewards (0 ≤ γ ≤ 1)

MDP Grid World

Navigate the agent using arrow keys or buttons. Reach the green goal (+1) while avoiding the red pit (-1). Each step costs -0.04 to encourage finding the shortest path.

Episode Summary:

Steps taken: 0

Total reward: 0.00

The Markov Property

The key assumption: the future depends only on the current state, not on the history of how we got there. Mathematically:

P(s_t+1 | s_t, a_t, s_t-1, a_t-1, ...) = P(s_t+1 | s_t, a_t)

This memoryless property makes MDPs computationally tractable while still being expressive enough for many real-world problems.

Policies and Goals

Policy (π)

A policy defines the agent's behavior - what action to take in each state. It can be deterministic π(s) → a or stochastic π(a|s) → [0,1].

Objective

Find the optimal policy π* that maximizes expected cumulative reward:

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ... = Σ γ^k R_t+k+1

Stochastic Transition Dynamics

Select a state and action to visualize transition probabilities. Arrow thickness represents probability magnitude.

Current State:

Action:

Key Insight:

Actions don't guarantee outcomes - they define probability distributions over next states. This stochasticity is what makes MDPs challenging and realistic.

Types of MDPs

Finite MDP

Finite number of states and actions - most common in theoretical analysis.

Continuous MDP

Continuous state/action spaces - common in robotics and control.

Partially Observable MDP (POMDP)

Agent doesn't fully observe the state - adds belief states and observations.

Multi-Agent MDP

Multiple agents with potentially conflicting objectives.

Solution Methods

Dynamic Programming: When model is known (value/policy iteration)
Monte Carlo: Learn from complete episodes
Temporal Difference: Learn from partial episodes (Q-learning, SARSA)
Policy Gradient: Directly optimize the policy

Discount Factor Impact

The discount factor γ determines how much the agent values future rewards. γ = 0 means only immediate rewards matter, γ = 1 means all rewards are equally important.

Your Discount Factor (γ): 0.90

Low γ (≈ 0)

Myopic agent, focuses on immediate rewards

High γ (≈ 1)

Far-sighted agent, values future rewards

Why MDPs Matter

Universal Framework: Can model any sequential decision problem
Optimal Solutions: Guaranteed to find best policy (in theory)
Foundation: All modern RL algorithms build on MDP theory
Real Applications: Robotics, game AI, resource management, finance
Bridge to Deep RL: Neural networks approximate MDP components