Text Summarization
What Is Summarization?
Text summarization is the task of compressing a document into a shorter version that preserves its most important information. A good summary is concise(much shorter than the source), faithful (it does not state anything the source did not), and coherent (it reads as a sensible standalone text). Summarization powers news digests, search-result snippets, meeting notes, and the “TL;DR” that sits atop long articles.
The field splits along two axes. The first is how the summary is produced:
Extractive
Select and stitch together sentences (or phrases) copied verbatim from the source. The summary is a subset of the original text. It can never introduce a factual error, but it can read choppily and cannot rephrase or merge ideas.
Abstractive
Generate new sentences that paraphrase the source — the way a person would. It reads fluently and can fuse information from many sentences, but it can also hallucinate: state details that were never in the source.
The second axis is the input: a single document (one article) versus multi-document (synthesizing across many sources, which adds redundancy removal and conflict handling). A related distinction is generic versus query-focused summarization, where the summary must answer a specific information need.
Approach 1: Extractive Summarization
Extractive summarization reframes the problem as sentence selection: score every sentence by how “central” it is, then keep the top few. The oldest and still-instructive scoring idea is word frequency: a sentence is important if it contains many of the document's frequently-used content words. Concretely:
- Tokenize the document and count each content word (dropping stopwords like “the”).
- Normalize each count by the maximum count, giving every word a weight in [0, 1].
- Score each sentence as the sum of its word weights.
- Select the top-k sentences and emit them in their original order so the summary still flows.
A graph-based refinement, TextRank, builds a graph where nodes are sentences and edge weights measure sentence similarity (e.g. shared words), then runs the PageRank algorithm: a sentence is important if it is similar to many other important sentences. It needs no training data and is still a strong baseline. The demo below uses the simpler frequency method so you can watch every score change as you edit.
Large language models are trained on enormous amounts of text gathered from the web.
During training, the model learns to predict the next token in a sequence, one token at a time.
This simple objective turns out to be remarkably powerful.
By predicting the next token over trillions of examples, the model absorbs grammar, facts, and reasoning patterns.
The resulting model can then be adapted to many tasks without retraining from scratch.
Summarization is one such task, where the model must compress a long document into a short, faithful overview.
Extractive summarization selects the most important sentences directly from the source text.
Abstractive summarization instead generates new sentences that paraphrase the key ideas.
Extractive methods cannot introduce factual errors because they only copy existing sentences.
Abstractive methods read more naturally but can hallucinate details that were never in the source.
Choosing between them depends on whether fluency or factual safety matters more for the application.
During training, the model learns to predict the next token in a sequence, one token at a time. By predicting the next token over trillions of examples, the model absorbs grammar, facts, and reasoning patterns. Summarization is one such task, where the model must compress a long document into a short, faithful overview.
Approach 2: Abstractive Summarization
Abstractive summarization treats the problem as sequence-to-sequencegeneration: read the document, then generate a summary token by token. The historical architecture was an encoder–decoder RNN with attention (and the pointer-generator network, which could either generate a new word or copy one from the source to handle rare names and numbers). Modern systems fine-tune pretrained transformers such as BART, T5, or PEGASUS — the last pretrained specifically by masking and regenerating whole “important” sentences, an objective tailored to summarization.
Today, instruction-tuned large language models summarize zero-shot from a prompt like “Summarize the following article in three sentences.” They are fluent and flexible, but the central risk remains faithfulness: a model can produce a confident, well-written summary that contradicts or invents facts. This is why faithfulness metrics and human review matter even when fluency looks perfect.
Why abstractive can be better
It compresses by rewriting — merging three rambling sentences into one crisp clause, resolving pronouns, and dropping redundancy. Extractive methods are stuck with whole sentences exactly as written.
Why it is riskier
Generation can hallucinate — assert a date, name, or causal claim absent from the source. Extractive output is automatically faithful at the sentence level; abstractive output must be checked.
How Do We Measure a Summary?
The standard automatic metric is ROUGE (Recall-Oriented Understudy for Gisting Evaluation), which compares the generated summary against one or more human reference summaries by counting overlapping units:
ROUGE-N
Overlap of n-grams. ROUGE-1 counts shared unigrams, ROUGE-2 shared bigrams — a proxy for “did the summary mention the right things?”
ROUGE-L
Longest common subsequence between summary and reference — rewards getting word order and longer matching spans right, not just isolated words.
Beyond ROUGE
BERTScore compares embeddings (semantic, not exact-word, overlap), and faithfulness metrics (e.g. entailment- or QA-based) specifically check that claims are supported by the source.
ROUGE is the historical workhorse because it is cheap and correlates with human judgment on extractive content selection. Its blind spot is exactly abstractive quality: a fluent, faithful paraphrase that uses different words can score lower than a clumsy near-copy, and ROUGE cannot detect a hallucinated fact. That gap is why semantic and faithfulness metrics — plus human evaluation — are now reported alongside it.
Why Summarization Is Hard
Faithfulness & hallucination
The hardest problem in abstractive summarization: the output must contain only claims entailed by the source. Fluency makes errors harder to catch, not easier.
What counts as “important”?
Salience is subjective and audience-dependent. The right summary of a clinical trial differs for a doctor, a patient, and a regulator — yet most systems produce one summary.
Long inputs
Books, transcripts, and code bases exceed a model's context window. Hierarchical or chunk-then-combine strategies help but can lose global structure and drop cross-chunk connections.
Redundancy & coherence
Picking the top-k sentences independently can select near-duplicates. Good systems add a redundancy penalty (e.g. Maximal Marginal Relevance) and worry about whether the kept sentences read coherently together.
Key Takeaways
- Summarization compresses a document while staying concise, faithful, and coherent — the core tension is fluency versus factual safety.
- Extractive methods select sentences from the source (always faithful, sometimes choppy); abstractive methods generate new text (fluent, but can hallucinate).
- Classic extractive scoring ranks sentences by content-word frequency; TextRank generalizes this with PageRank over a sentence-similarity graph — neither needs training data.
- Modern abstractive systems fine-tune transformers (BART, T5, PEGASUS) or prompt instruction-tuned LLMs zero-shot.
- ROUGE measures n-gram / longest-common-subsequence overlap with reference summaries, but it misses paraphrase quality and cannot detect hallucinations — so semantic and faithfulness metrics and human review are reported alongside it.
- The open problems are faithfulness, subjective salience, very long inputs, and redundancy — exactly where the simple frequency baseline in the demo above breaks down.