Multi-Layer Perceptrons (MLPs)

Introduction

Multi-Layer Perceptrons (MLPs) are the foundation of deep learning. By stacking multiple layers of neurons, they can learn complex, non-linear patterns that single perceptrons cannot solve, like the famous XOR problem.

← Review: The Perceptron → Related: Loss Functions → Related: Gradient Descent

Universal Approximation

The Universal Approximation Theorem states that an MLP with just one hidden layer can approximate any continuous function to arbitrary accuracy, given enough hidden neurons.

Key Components:

Hidden Layers: Transform inputs into increasingly abstract representations
Activation Functions: Introduce non-linearity (ReLU, Sigmoid, Tanh)
Backpropagation: Algorithm to train the network by propagating errors backward

Interactive MLP Trainer

Explore how MLPs solve the XOR problem and approximate functions. Click to add points (shift+click for different class):

Decision Boundary

Click: add blue point (class 1) | Shift+Click: add red point (class 0)

Network Architecture

Neural network visualization with activation intensities

Network Configuration

Weight Initialization

Number of Hidden Layers: 2

Hidden Layer 1: 4 neurons

Hidden Layer 2: 3 neurons

Hidden Layer Activation

Output Activation

Training Controls

Learning Rate: 0.100

Batch Size: 1

Stochastic Gradient Descent

Show Weight Intensities

Step-by-Step Training

Show Backprop Details

Data & Presets

Activation Function

Selected activation function visualization

Loss Curve

Loss curve shows training progress.

Backpropagation Algorithm

Backpropagation trains MLPs by computing gradients of the loss function with respect to each weight:

Forward Pass: Compute predictions by propagating inputs forward through layers
Loss Calculation: Compare predictions with true labels
Backward Pass: Compute gradients by propagating errors backward
Weight Update: Update weights using gradient descent: w = w - η∇w

The Chain Rule

Backpropagation uses the chain rule of calculus to compute gradients efficiently:

∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w

Where L is loss, a is activation, z is pre-activation, and w is weight.

Activation Functions

ReLU

f(x) = max(0, x)

Most popular, prevents vanishing gradients, sparse activation

Sigmoid

f(x) = 1/(1+e^(-x))

Output between 0-1, smooth, but suffers from vanishing gradients

Tanh

f(x) = tanh(x)

Output between -1 and 1, zero-centered, better than sigmoid

Key Takeaways

MLPs can solve non-linearly separable problems that single perceptrons cannot
Hidden layers learn hierarchical feature representations
Backpropagation enables efficient training of deep networks through the chain rule
Choice of activation function significantly impacts learning dynamics
Weight initialization affects convergence speed and stability
Batch size influences the trade-off between gradient accuracy and computation
MLPs are the foundation for more complex architectures like CNNs and RNNs
Understanding MLPs is crucial for building modern deep learning models