LSTM & GRU

Introduction

Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks solve the vanishing gradient problem of traditional RNNs through sophisticated gating mechanisms that control information flow.

Key Innovations

LSTM Features

  • Separate cell state and hidden state
  • Three gates: forget, input, and output
  • Explicit memory control mechanism
  • Better long-term dependency learning

GRU Features

  • Simpler architecture with fewer parameters
  • Two gates: reset and update
  • Combines forget and input gates
  • Often comparable performance to LSTM

Interactive Cell Visualizations

Explore how LSTM and GRU cells process sequences step by step:

Configuration

Input Sequence

Animation

Current Step: 1 / 5

LSTM Cell

GRU Cell

LSTM Gate Activations Over Time

Blue: high activation | Red: low activation

Interactive Gate Manipulation

LSTM vs GRU Hidden State Comparison

Average hidden state values over time for both architectures

Mathematical Foundations

LSTM Equations

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)
i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)
C_t = f_t * C_{t-1} + i_t * C̃_t
o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)

GRU Equations

r_t = σ(W_r · [h_{t-1}, x_t] + b_r)
z_t = σ(W_z · [h_{t-1}, x_t] + b_z)
h̃_t = tanh(W_h · [r_t * h_{t-1}, x_t] + b_h)
h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t

Gate Functions Explained

Forget Gate

Decides what information to discard from cell state

Input Gate

Determines which values to update in cell state

Output Gate

Controls what parts of cell state to output

Update Gate (GRU)

Combines forget and input gates into one

LSTM vs GRU Comparison

AspectLSTMGRU
Gates3 gates (forget, input, output)2 gates (reset, update)
ParametersMore parametersFewer parameters
MemorySeparate cell and hidden stateOnly hidden state
PerformanceBetter on complex tasksOften comparable, faster training

Key Takeaways

  • LSTM and GRU solve the vanishing gradient problem through gating mechanisms
  • Gates control the flow of information, enabling selective memory
  • LSTM has more complex architecture but potentially better long-term memory
  • GRU is simpler and often achieves comparable performance with fewer parameters
  • Both architectures enable learning of long-term dependencies
  • Choice between LSTM and GRU often depends on specific task requirements
  • Understanding gating mechanisms is crucial for modern sequence modeling