Generative AI / Lesson 7

Autoregressive Models

The foundation of sequential generation in modern AI

Introduction

Autoregressive models are a class of statistical models that predict future values based on past values in a sequence. In machine learning, they form the backbone of modern language models, music generation, and time series forecasting. These models generate outputs one element at a time, using previously generated elements as context for predicting the next.

Key Insight: Autoregressive models decompose the joint probability distribution of a sequence into a product of conditional distributions, making complex generation tasks tractable.

Core Concept

The fundamental principle of autoregressive modeling is the chain rule of probability:

P(x₁, x₂, ..., xₙ) = P(x₁) × P(x₂|x₁) × P(x₃|x₁,x₂) × ... × P(xₙ|x₁,...,xₙ₋₁)

Each element in the sequence is predicted based on all previous elements, creating a natural left-to-right generation process.

Interactive Visualization

Watch how an autoregressive model generates text token by token:

Each new token is predicted based on all previous tokens in the sequence

Types of Autoregressive Models

Language Models

Generate text by predicting the next word/token:

• GPT series (GPT-3, GPT-4)
• LSTM language models
• Character-level RNNs

Learn more about language modeling →

Time Series Models

Forecast future values from historical data:

• ARIMA models
• DeepAR
• WaveNet for audio

Image Generation

Generate pixels sequentially:

• PixelRNN/PixelCNN
• Image GPT
• DALL-E (partially autoregressive)

Audio Generation

Generate audio samples sequentially:

• WaveNet
• SampleRNN
• Jukebox

Architecture Components

1. Context Window

The number of previous tokens the model can "see" when making predictions:

Fixed Window

Last N tokens

Full History

All previous tokens

Hierarchical

Multiple scales

2. Prediction Head

Converts hidden representations into probability distributions:

# Simplified prediction head
hidden_state = model(context)  # [batch, hidden_dim]
logits = linear(hidden_state)  # [batch, vocab_size]
probs = softmax(logits)         # [batch, vocab_size]
next_token = sample(probs)      # [batch, 1]

3. Attention Mechanisms

Modern autoregressive models use attention to process long sequences:

Token 1

Token 2

Token 3

Learn about Transformers & Attention →

Training Autoregressive Models

Teacher Forcing

During training, models use the ground truth previous tokens rather than their own predictions:

Training Time:

Input: "The cat sat" Target: "cat sat on" (Uses actual sequence)

Inference Time:

Input: "The cat sat" Output: Model's prediction (Uses own predictions)

Exposure Bias

Models may struggle when their predictions deviate from training data, as errors compound during generation.

Scheduled Sampling

Gradually transition from teacher forcing to using model predictions during training to reduce exposure bias.

Sampling Strategies

Greedy Decoding

Always select the highest probability token:

✓ Advantages:

• Deterministic
• Fast

✗ Disadvantages:

• Repetitive
• Less creative

Sampling with Temperature

Control randomness by scaling logits:

probs = softmax(logits / temperature)

T=0.1 (focused)T=1.0 (balanced)T=2.0 (creative)

Top-k Sampling

Sample from the k most likely tokens only

Top-p (Nucleus) Sampling

Sample from tokens with cumulative probability ≤ p

Applications in Modern AI

Language Generation

GPT models for text completion, dialogue, and creative writing

Learn more →

Code Generation

GitHub Copilot and similar tools for code completion

Multi-modal Generation

DALL-E 2/3 combining autoregressive and diffusion approaches

Advantages and Limitations

Advantages

+Tractable likelihood computation and training
+Natural for sequential data (text, audio, time series)
+Can generate variable-length outputs
+Interpretable generation process

Limitations

-Sequential generation can be slow
-Error accumulation in long sequences
-Limited ability to revise earlier predictions
-May struggle with global coherence

Next Steps

Continue exploring generative AI architectures:

→ Transformers for Generation
← How LLMs Reason