LSTM & GRU

Introduction

Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks solve the vanishing gradient problem of traditional RNNs through sophisticated gating mechanisms that control information flow.

Key Innovations

LSTM Features

Separate cell state and hidden state
Three gates: forget, input, and output
Explicit memory control mechanism
Better long-term dependency learning

GRU Features

Simpler architecture with fewer parameters
Two gates: reset and update
Combines forget and input gates
Often comparable performance to LSTM

Interactive Cell Visualizations

Explore how LSTM and GRU cells process sequences step by step:

Configuration

Model Type

Hidden Size: 4

Show Gate Values

Input Sequence

x0:

x1:

x2:

x3:

x4:

Animation

Current Step: 1 / 5

LSTM Cell

GRU Cell

LSTM Gate Activations Over Time

Blue: high activation | Red: low activation

Interactive Gate Manipulation

Enable Manual Gate Control

LSTM vs GRU Hidden State Comparison

Average hidden state values over time for both architectures

Mathematical Foundations

LSTM Equations

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

i_t = σ(W_i · [h_{t-1}, x_t] + b_i)

C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)

C_t = f_t * C_{t-1} + i_t * C̃_t

o_t = σ(W_o · [h_{t-1}, x_t] + b_o)

h_t = o_t * tanh(C_t)

GRU Equations

r_t = σ(W_r · [h_{t-1}, x_t] + b_r)

z_t = σ(W_z · [h_{t-1}, x_t] + b_z)

h̃_t = tanh(W_h · [r_t * h_{t-1}, x_t] + b_h)

h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t

Gate Functions Explained

Forget Gate

Decides what information to discard from cell state

Input Gate

Determines which values to update in cell state

Output Gate

Controls what parts of cell state to output

Update Gate (GRU)

Combines forget and input gates into one

LSTM vs GRU Comparison

Aspect	LSTM	GRU
Gates	3 gates (forget, input, output)	2 gates (reset, update)
Parameters	More parameters	Fewer parameters
Memory	Separate cell and hidden state	Only hidden state
Performance	Better on complex tasks	Often comparable, faster training

Key Takeaways

LSTM and GRU solve the vanishing gradient problem through gating mechanisms
Gates control the flow of information, enabling selective memory
LSTM has more complex architecture but potentially better long-term memory
GRU is simpler and often achieves comparable performance with fewer parameters
Both architectures enable learning of long-term dependencies
Choice between LSTM and GRU often depends on specific task requirements
Understanding gating mechanisms is crucial for modern sequence modeling