Recurrent Neural Networks (RNNs)

Introduction

Recurrent Neural Networks (RNNs) are designed to process sequential data by maintaining internal memory. Unlike feedforward networks, RNNs can use information from previous time steps to inform current predictions.

Key Concepts

Sequential Processing

  • Processes one element at a time
  • Maintains hidden state between steps
  • Can handle variable-length sequences
  • Shares parameters across time steps

Challenges

  • Vanishing gradient problem
  • Difficulty learning long-term dependencies
  • Sequential processing limits parallelization
  • Can suffer from exploding gradients

Interactive RNN Visualizer

Watch how RNNs process sequences step by step and see how hidden states evolve:

Network Configuration

Sequence Type

Sequence Examples

RNN Unrolled Through Time

Hidden State Evolution

Blue: positive activations | Red: negative activations

Character-Level Text Generation

Watch how RNNs generate text character by character, maintaining context through hidden states:

Text Generation Demo

Lower = more conservative | Higher = more creative

Generated Output

Character Probability Distribution

Top 10 most likely next characters

How It Works:

  1. Each character is converted to a one-hot vector
  2. RNN processes the character and updates hidden state
  3. Output layer produces probability distribution over vocabulary
  4. Next character is sampled based on temperature
  5. Process repeats with the new character

💡 Temperature Control

  • Low (0.1-0.5): Conservative, repetitive text
  • Medium (0.8-1.2): Balanced creativity
  • High (1.5-2.0): Creative but potentially nonsensical

Mathematical Foundation

RNN Forward Pass:

h_t = tanh(W_xh × x_t + W_hh × h_{t-1} + b_h)
y_t = W_hy × h_t + b_y

Where h_t is the hidden state, x_t is the input, and y_t is the output at time t

Backpropagation Through Time (BPTT):

∂L/∂W_hh = Σ_t ∂L_t/∂W_hh
∂L_t/∂h_k = ∂L_t/∂h_t × Π_{i=k+1}^t ∂h_i/∂h_{i-1}

Gradients must be propagated back through all time steps

Vanishing Gradient Problem

The vanishing gradient problem occurs when gradients become exponentially small as they propagate back through time, making it difficult to learn long-term dependencies.

Why it happens:

  • Gradients are products of derivatives through time
  • Derivatives of activation functions are typically < 1
  • Long sequences lead to very small gradient products
  • Early layers receive negligible gradient updates

Solutions:

  • LSTM and GRU architectures with gating mechanisms
  • Gradient clipping to prevent exploding gradients
  • Better initialization strategies
  • Residual connections and attention mechanisms

Applications

Language Modeling

Predicting the next word in a sequence for text generation and completion

Time Series

Forecasting stock prices, weather patterns, and sensor data

Sequence Classification

Sentiment analysis, document classification, and sequence labeling

Key Takeaways

  • RNNs can process variable-length sequences using internal memory
  • Hidden states carry information from previous time steps
  • Vanishing gradients make learning long-term dependencies challenging
  • BPTT extends backpropagation to handle temporal sequences
  • RNNs laid the foundation for more advanced architectures like LSTMs and Transformers
  • Understanding RNNs is crucial for working with sequential data in deep learning