LSTM & GRU
Introduction
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks solve the vanishing gradient problem of traditional RNNs through sophisticated gating mechanisms that control information flow.
Key Innovations
LSTM Features
- Separate cell state and hidden state
- Three gates: forget, input, and output
- Explicit memory control mechanism
- Better long-term dependency learning
GRU Features
- Simpler architecture with fewer parameters
- Two gates: reset and update
- Combines forget and input gates
- Often comparable performance to LSTM
Interactive Cell Visualizations
Explore how LSTM and GRU cells process sequences step by step:
Configuration
Input Sequence
Animation
Current Step: 1 / 5
LSTM Cell
GRU Cell
LSTM Gate Activations Over Time
Blue: high activation | Red: low activation
Interactive Gate Manipulation
LSTM vs GRU Hidden State Comparison
Average hidden state values over time for both architectures
Mathematical Foundations
LSTM Equations
f_t = σ(W_f · [h_{t-1}, x_t] + b_f)
i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)
C_t = f_t * C_{t-1} + i_t * C̃_t
o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)
GRU Equations
r_t = σ(W_r · [h_{t-1}, x_t] + b_r)
z_t = σ(W_z · [h_{t-1}, x_t] + b_z)
h̃_t = tanh(W_h · [r_t * h_{t-1}, x_t] + b_h)
h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t
Gate Functions Explained
Forget Gate
Decides what information to discard from cell state
Input Gate
Determines which values to update in cell state
Output Gate
Controls what parts of cell state to output
Update Gate (GRU)
Combines forget and input gates into one
LSTM vs GRU Comparison
Aspect | LSTM | GRU |
---|---|---|
Gates | 3 gates (forget, input, output) | 2 gates (reset, update) |
Parameters | More parameters | Fewer parameters |
Memory | Separate cell and hidden state | Only hidden state |
Performance | Better on complex tasks | Often comparable, faster training |
Key Takeaways
- LSTM and GRU solve the vanishing gradient problem through gating mechanisms
- Gates control the flow of information, enabling selective memory
- LSTM has more complex architecture but potentially better long-term memory
- GRU is simpler and often achieves comparable performance with fewer parameters
- Both architectures enable learning of long-term dependencies
- Choice between LSTM and GRU often depends on specific task requirements
- Understanding gating mechanisms is crucial for modern sequence modeling