Multi-Layer Perceptrons (MLPs)

Introduction

Multi-Layer Perceptrons (MLPs) are the foundation of deep learning. By stacking multiple layers of neurons, they can learn complex, non-linear patterns that single perceptrons cannot solve, like the famous XOR problem.

Universal Approximation

The Universal Approximation Theorem states that an MLP with just one hidden layer can approximate any continuous function to arbitrary accuracy, given enough hidden neurons.

Key Components:

  • Hidden Layers: Transform inputs into increasingly abstract representations
  • Activation Functions: Introduce non-linearity (ReLU, Sigmoid, Tanh)
  • Backpropagation: Algorithm to train the network by propagating errors backward

Interactive MLP Trainer

Explore how MLPs solve the XOR problem and approximate functions. Click to add points (shift+click for different class):

Decision Boundary

Click: add blue point (class 1) | Shift+Click: add red point (class 0)

Network Architecture

Neural network visualization with activation intensities

Network Configuration

Training Controls

Stochastic Gradient Descent

Data & Presets

Activation Function

Selected activation function visualization

Loss Curve

Loss curve shows training progress.

Backpropagation Algorithm

Backpropagation trains MLPs by computing gradients of the loss function with respect to each weight:

  1. Forward Pass: Compute predictions by propagating inputs forward through layers
  2. Loss Calculation: Compare predictions with true labels
  3. Backward Pass: Compute gradients by propagating errors backward
  4. Weight Update: Update weights using gradient descent: w = w - η∇w

The Chain Rule

Backpropagation uses the chain rule of calculus to compute gradients efficiently:

∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w

Where L is loss, a is activation, z is pre-activation, and w is weight.

Activation Functions

ReLU

f(x) = max(0, x)

Most popular, prevents vanishing gradients, sparse activation

Sigmoid

f(x) = 1/(1+e^(-x))

Output between 0-1, smooth, but suffers from vanishing gradients

Tanh

f(x) = tanh(x)

Output between -1 and 1, zero-centered, better than sigmoid

Key Takeaways

  • MLPs can solve non-linearly separable problems that single perceptrons cannot
  • Hidden layers learn hierarchical feature representations
  • Backpropagation enables efficient training of deep networks through the chain rule
  • Choice of activation function significantly impacts learning dynamics
  • Weight initialization affects convergence speed and stability
  • Batch size influences the trade-off between gradient accuracy and computation
  • MLPs are the foundation for more complex architectures like CNNs and RNNs
  • Understanding MLPs is crucial for building modern deep learning models

Next Steps