Text Generation
Introduction
Text generation is the task of producing fluent, coherent text one token at a time. Modern language models (GPT-style decoders) generate autoregressively: they predict the next token from everything generated so far, append it to the context, and repeat. The model itself only ever does one thing — output a probability distribution over the vocabulary. How we turn that distribution into an actual chosen token is the job of a decoding strategy, and that choice has an enormous effect on whether the output feels robotic, creative, repetitive, or wildly incoherent.
Key insight: the model gives you a distribution. Decoding is the art of choosing from it. Two systems with the exact same model can produce dramatically different text depending purely on their decoding strategy.
Autoregressive Generation
Generation is a loop. Given a context (the prompt plus everything generated so far), the model produces scores for every token in its vocabulary. We pick one token, append it, and feed the longer sequence back in to predict the next token. This continues until an end-of-sequence token is produced or a length limit is hit.
1. Predict
Feed the context through the model to get a logit (raw score) for every token in the vocabulary.
2. Choose
Convert logits to probabilities with softmax, then apply a decoding strategy to select one token.
3. Append & repeat
Add the chosen token to the context and loop until an end-of-sequence token or length limit.
The model converts logits z into a probability distribution using the softmax function. For token i out of a vocabulary of size V:
Here zᵢ is the logit for token i, e^(zᵢ) exponentiates it so everything is positive, and the denominator Σⱼ e^(zⱼ) sums over all tokens so the result is a valid probability distribution (every value in [0, 1], summing to 1).
Decoding Strategies — the heart of generation
Once we have a probability distribution, how do we pick the next token? This is where decoding strategies come in. They trade off coherence (sticking with high-probability, "safe" tokens) against diversity (allowing lower-probability, more surprising tokens).
Greedy decoding
Always pick the single highest-probability token: argmaxᵢ P(token i). It is deterministic and fast, but tends to be repetitive and bland — and because it is locally greedy, it can miss a more probable sequence that requires a temporarily lower-probability token. It is prone to degenerate loops ("the the the").
Beam search
Instead of committing to one token, beam search keeps the top b partial sequences ("beams") at each step, expanding each and again keeping the b best by total sequence probability. It explores more of the search space than greedy and produces more coherent output, which makes it popular for tasks with a "correct" answer like translation and summarization. The downside: it tends to produce bland, generic, repetitive text for open-ended generation, because high-probability sequences are often the most predictable ones.
Sampling with temperature
Rather than always taking the maximum, sampling draws a token randomly according to its probability. Temperature T reshapes the distribution by scaling the logits before softmax:
- T < 1: dividing by a number less than 1 magnifies the gaps between logits → distribution gets sharper (peaks exaggerated). Safer, more focused, more repetitive.
- T = 1: the original distribution, unchanged.
- T > 1: shrinks the gaps → distribution gets flatter (more uniform). More diverse, more random, more likely to go off the rails.
- T → 0: the sharpest token dominates completely → this is equivalent to greedy decoding.
Top-k sampling
Even after temperature scaling, the long tail of unlikely tokens can occasionally be sampled and derail the text. Top-k sampling restricts the candidate set to the k highest-probability tokens, zeroes out the rest, and renormalizes so the survivors sum to 1. A fixed k is blunt, though: when the model is very confident, even k=40 can include junk; when it is unsure, k may be too small.
Top-p (nucleus) sampling
Top-p (also called nucleus sampling) adapts the candidate set to the model's confidence. Sort tokens by probability and keep the smallest set whose cumulative probability ≥ p; everything else is dropped and the remainder renormalized. When the model is confident, that nucleus is tiny (maybe one or two tokens); when it is uncertain, the nucleus grows to admit many candidates. This dynamic sizing is why top-p is the most popular default for open-ended generation. Top-k and top-p are often combined.
Repetition penalty
A complementary trick to fight loops: down-weight tokens that have already appeared by dividing (or subtracting from) their logits before softmax. This discourages the model from repeating the same words and phrases, and is frequently applied alongside any of the strategies above.
Interactive: Decoding-Strategy Explorer
Below is a fixed example next-token distribution for the context "The weather today is ___". Adjust temperature, top-k, and top-p, and toggle between greedy and sampling. The bar chart shows the modified distribution after applying temperature and the top-k / top-p truncation (renormalized). Grey bars are tokens that got zeroed out. Press Sample a token repeatedly to feel the stochasticity: low temperature with a small nucleus is almost deterministic; high temperature with everything "off" can surprise you.
Strategy
Sampling draws randomly from the (modified) distribution.
T<1 sharpens (safer), T>1 flattens (more random), T→0 ≈ greedy.
Keep only the k highest-probability tokens.
Keep the smallest set whose cumulative prob ≥ p (nucleus).
Try pressing repeatedly. With high temperature or large top-k/top-p you will see different tokens; with low values it nearly always returns the same word.
Things to try: set T = 0.3 and watch the top bar tower over the rest. Crank T = 1.8 and watch the bars even out. Drop top-p to 0.5 and notice the nucleus shrink to just the leading tokens. Switch to Greedy and the candidate set collapses to a single token.
Strategy Comparison
| Strategy | How it picks | Determinism | Tends to be | Best for |
|---|---|---|---|---|
| Greedy | argmax of the distribution | Deterministic | Repetitive, bland; can loop | Short, factual answers |
| Beam search | Keeps top-b sequences by total probability | Deterministic | Coherent but generic/bland | Translation, summarization |
| Temperature sampling | Sample after scaling logits by 1/T | Stochastic | Tunable: safe ↔ wild | Creative writing, dialogue |
| Top-k | Sample from k highest-prob tokens | Stochastic | Diverse but cutoff is fixed | General sampling with a tail guard |
| Top-p (nucleus) | Sample from smallest set with cum. prob ≥ p | Stochastic | Adaptive, well-balanced | Default for open-ended generation |
Common Pitfalls
Watch out for
- • Greedy / beam producing repetitive loops on open-ended text
- • Temperature too high → incoherent, off-topic rambling
- • Temperature too low → dull, repetitive, "safe" output
- • Forgetting to renormalize after top-k / top-p truncation
- • Stacking aggressive top-k, top-p, and low T until only one token survives
Good defaults
- • Factual / deterministic tasks: greedy or low temperature (T ≈ 0–0.3)
- • Open-ended generation: sampling with T ≈ 0.7–1.0 and top-p ≈ 0.9
- • Add a mild repetition penalty to discourage loops
- • Tune one knob at a time so you can see its effect
Key Takeaways
- Generation is autoregressive: predict the next token from the context, append it, and repeat.
- The model outputs a probability distribution over the vocabulary via softmax: e^(zᵢ) / Σⱼ e^(zⱼ).
- The decoding strategy — not the model — decides how a token is chosen from that distribution.
- Greedy takes the argmax (fast, deterministic, repetitive); beam search keeps the top-b sequences (more coherent but often bland).
- Temperature scales logits before softmax — softmax(zᵢ / T): T<1 sharpens, T>1 flattens, T→0 ≈ greedy.
- Top-k keeps the k highest-probability tokens; top-p (nucleus) keeps the smallest set with cumulative probability ≥ p, then both renormalize.
- A repetition penalty down-weights already-used tokens to fight degenerate loops.