Text Generation

Introduction

Text generation is the task of producing fluent, coherent text one token at a time. Modern language models (GPT-style decoders) generate autoregressively: they predict the next token from everything generated so far, append it to the context, and repeat. The model itself only ever does one thing — output a probability distribution over the vocabulary. How we turn that distribution into an actual chosen token is the job of a decoding strategy, and that choice has an enormous effect on whether the output feels robotic, creative, repetitive, or wildly incoherent.

Key insight: the model gives you a distribution. Decoding is the art of choosing from it. Two systems with the exact same model can produce dramatically different text depending purely on their decoding strategy.

Autoregressive Generation

Generation is a loop. Given a context (the prompt plus everything generated so far), the model produces scores for every token in its vocabulary. We pick one token, append it, and feed the longer sequence back in to predict the next token. This continues until an end-of-sequence token is produced or a length limit is hit.

1. Predict

Feed the context through the model to get a logit (raw score) for every token in the vocabulary.

2. Choose

Convert logits to probabilities with softmax, then apply a decoding strategy to select one token.

3. Append & repeat

Add the chosen token to the context and loop until an end-of-sequence token or length limit.

The model converts logits z into a probability distribution using the softmax function. For token i out of a vocabulary of size V:

P(token i) = softmax(z)ᵢ = e^(zᵢ) / Σⱼ e^(zⱼ)

Here zᵢ is the logit for token i, e^(zᵢ) exponentiates it so everything is positive, and the denominator Σⱼ e^(zⱼ) sums over all tokens so the result is a valid probability distribution (every value in [0, 1], summing to 1).

Decoding Strategies — the heart of generation

Once we have a probability distribution, how do we pick the next token? This is where decoding strategies come in. They trade off coherence (sticking with high-probability, "safe" tokens) against diversity (allowing lower-probability, more surprising tokens).

Greedy decoding

Always pick the single highest-probability token: argmaxᵢ P(token i). It is deterministic and fast, but tends to be repetitive and bland — and because it is locally greedy, it can miss a more probable sequence that requires a temporarily lower-probability token. It is prone to degenerate loops ("the the the").

Beam search

Instead of committing to one token, beam search keeps the top b partial sequences ("beams") at each step, expanding each and again keeping the b best by total sequence probability. It explores more of the search space than greedy and produces more coherent output, which makes it popular for tasks with a "correct" answer like translation and summarization. The downside: it tends to produce bland, generic, repetitive text for open-ended generation, because high-probability sequences are often the most predictable ones.

Sampling with temperature

Rather than always taking the maximum, sampling draws a token randomly according to its probability. Temperature T reshapes the distribution by scaling the logits before softmax:

P(token i) = softmax(zᵢ / T) = e^(zᵢ/T) / Σⱼ e^(zⱼ/T)

T < 1: dividing by a number less than 1 magnifies the gaps between logits → distribution gets sharper (peaks exaggerated). Safer, more focused, more repetitive.
T = 1: the original distribution, unchanged.
T > 1: shrinks the gaps → distribution gets flatter (more uniform). More diverse, more random, more likely to go off the rails.
T → 0: the sharpest token dominates completely → this is equivalent to greedy decoding.

Top-k sampling

Even after temperature scaling, the long tail of unlikely tokens can occasionally be sampled and derail the text. Top-k sampling restricts the candidate set to the k highest-probability tokens, zeroes out the rest, and renormalizes so the survivors sum to 1. A fixed k is blunt, though: when the model is very confident, even k=40 can include junk; when it is unsure, k may be too small.

Top-p (nucleus) sampling

Top-p (also called nucleus sampling) adapts the candidate set to the model's confidence. Sort tokens by probability and keep the smallest set whose cumulative probability ≥ p; everything else is dropped and the remainder renormalized. When the model is confident, that nucleus is tiny (maybe one or two tokens); when it is uncertain, the nucleus grows to admit many candidates. This dynamic sizing is why top-p is the most popular default for open-ended generation. Top-k and top-p are often combined.

Repetition penalty

A complementary trick to fight loops: down-weight tokens that have already appeared by dividing (or subtracting from) their logits before softmax. This discourages the model from repeating the same words and phrases, and is frequently applied alongside any of the strategies above.

Interactive: Decoding-Strategy Explorer

Below is a fixed example next-token distribution for the context "The weather today is ___". Adjust temperature, top-k, and top-p, and toggle between greedy and sampling. The bar chart shows the modified distribution after applying temperature and the top-k / top-p truncation (renormalized). Grey bars are tokens that got zeroed out. Press Sample a token repeatedly to feel the stochasticity: low temperature with a small nucleus is almost deterministic; high temperature with everything "off" can surprise you.

Context: The weather today is ___

Blue bars = candidates kept after filtering (probabilities renormalized to sum to 100%). Grey = tokens zeroed out by top-k / top-p.

Strategy

Sampling draws randomly from the (modified) distribution.

Temperature (T): 1.00

T<1 sharpens (safer), T>1 flattens (more random), T→0 ≈ greedy.

top-k: 10 (off)

Keep only the k highest-probability tokens.

top-p: 1.00 (off)

Keep the smallest set whose cumulative prob ≥ p (nucleus).

Candidates kept:10

Try pressing repeatedly. With high temperature or large top-k/top-p you will see different tokens; with low values it nearly always returns the same word.

Things to try: set T = 0.3 and watch the top bar tower over the rest. Crank T = 1.8 and watch the bars even out. Drop top-p to 0.5 and notice the nucleus shrink to just the leading tokens. Switch to Greedy and the candidate set collapses to a single token.

Strategy Comparison

Strategy	How it picks	Determinism	Tends to be	Best for
Greedy	argmax of the distribution	Deterministic	Repetitive, bland; can loop	Short, factual answers
Beam search	Keeps top-b sequences by total probability	Deterministic	Coherent but generic/bland	Translation, summarization
Temperature sampling	Sample after scaling logits by 1/T	Stochastic	Tunable: safe ↔ wild	Creative writing, dialogue
Top-k	Sample from k highest-prob tokens	Stochastic	Diverse but cutoff is fixed	General sampling with a tail guard
Top-p (nucleus)	Sample from smallest set with cum. prob ≥ p	Stochastic	Adaptive, well-balanced	Default for open-ended generation

Common Pitfalls

Watch out for

• Greedy / beam producing repetitive loops on open-ended text
• Temperature too high → incoherent, off-topic rambling
• Temperature too low → dull, repetitive, "safe" output
• Forgetting to renormalize after top-k / top-p truncation
• Stacking aggressive top-k, top-p, and low T until only one token survives

Good defaults

• Factual / deterministic tasks: greedy or low temperature (T ≈ 0–0.3)
• Open-ended generation: sampling with T ≈ 0.7–1.0 and top-p ≈ 0.9
• Add a mild repetition penalty to discourage loops
• Tune one knob at a time so you can see its effect

Key Takeaways

Generation is autoregressive: predict the next token from the context, append it, and repeat.
The model outputs a probability distribution over the vocabulary via softmax: e^(zᵢ) / Σⱼ e^(zⱼ).
The decoding strategy — not the model — decides how a token is chosen from that distribution.
Greedy takes the argmax (fast, deterministic, repetitive); beam search keeps the top-b sequences (more coherent but often bland).
Temperature scales logits before softmax — softmax(zᵢ / T): T<1 sharpens, T>1 flattens, T→0 ≈ greedy.
Top-k keeps the k highest-probability tokens; top-p (nucleus) keeps the smallest set with cumulative probability ≥ p, then both renormalize.
A repetition penalty down-weights already-used tokens to fight degenerate loops.