Part-of-Speech Tagging

What Is POS Tagging?

Part-of-speech (POS) tagging is the task of assigning a grammatical category — noun, verb, adjective, determiner, preposition, and so on — to every word in a sentence. It is one of the oldest and most fundamental problems in natural language processing, and a building block for parsing, named-entity recognition, information extraction, and machine translation.

Given the sentence “The old dogs run”, a tagger should output:

The/DET   old/ADJ   dogs/NOUN   run/VERB

This looks easy until you notice that the same word can take different tags in different sentences. POS tagging is therefore not a simple dictionary lookup — it is a sequence labeling problem where the right answer depends on context.

The Penn Treebank Tagset

A tagset is the fixed inventory of labels a tagger can produce. The most widely used English tagset is the Penn Treebank (PTB) tagset, which has about 36 word tags (plus punctuation tags). It is more fine-grained than the coarse school-grammar categories: instead of a single “verb” it distinguishes tense and form, and instead of a single “noun” it distinguishes singular, plural, and proper nouns.

PTB TagMeaningExample
DTDeterminerthe, a, some
NN / NNSNoun, singular / pluraldog / dogs
NNPProper nounLondon
VB / VBD / VBZVerb: base / past / 3rd-sing.run / ran / runs
JJAdjectiveold, green
RBAdverbquickly
INPreposition / subordinatorin, of, that
PRPPersonal pronounshe, it

The demo below uses a deliberately tiny tagset — DET, NOUN, VERB, ADJ — so the trellis stays readable. The Universal Dependencies project standardizes a similar coarse set of 17 tags across many languages.

Why It Is Ambiguous

The central difficulty is lexical ambiguity: a single word form can belong to several parts of speech, and only the surrounding context decides which one is meant. Classic examples:

book a flight”

Here book is a VERB (to reserve).

“read a book

Here book is a NOUN (an object).

Many of the most common English words are ambiguous this way: back, that, like, round, well. A tagger cannot resolve them word-by-word in isolation; it has to reason about the most plausible sequence of tags for the whole sentence. This is what motivates probabilistic sequence models.

The Hidden Markov Model Approach

A Hidden Markov Model (HMM) treats the tags as hidden states that generate the observed words. We never see the tags directly — only the words — and we want to infer the most likely hidden state sequence. An HMM tagger has three ingredients:

States = tags

Each hidden state is a POS tag (DET, NOUN, VERB, ADJ, …). A tagging of an n-word sentence is a path t₁ t₂ … tₙ through these states.

Transition probabilities P(tagₜ | tagₜ₋₁)

How likely one tag is to follow another. The Markov assumption is that the next tag depends only on the current tag, not the entire history. For example P(NOUN | DET) is high (determiners are usually followed by nouns), while P(VERB | DET) is low.

Emission probabilities P(word | tag)

How likely a tag is to “emit” a given word. For example P(“dogs” | NOUN) is moderate, while P(“dogs” | DET) is essentially zero.

The probability the model assigns to a full word/tag sequence factorizes into a product of transitions and emissions (with a start probability for the first tag):

P(w₁…wₙ, t₁…tₙ) = P(t₁)·P(w₁|t₁) × ∏ₜ₌₂ⁿ P(tₜ | tₜ₋₁) · P(wₜ | tₜ)

Tagging then means finding the tag sequence that maximizes this quantity: t̂ = argmaxt₁…tₙ P(t₁…tₙ | w₁…wₙ). The number of possible tag sequences is (number of tags)n, which is exponential — far too many to enumerate. We need a smarter search.

The Viterbi Algorithm

The Viterbi algorithm finds the single most likely state sequence using dynamic programming over a trellis. The trellis is a grid: one column per word, one row per tag. We sweep left to right, and at each cell we keep only the best way to arrive there.

Let δₜ(s) be the probability of the best tag path that ends in tag s at position t. The key recurrence is:

δₜ(s) = [ maxₛ δₜ₋₁(s') · P(s | s') ] · P(wₜ | s)

The max over the previous tag s' is the crucial step: instead of tracking every path, each cell records only the best incoming path and a backpointer to the previous tag that produced it. Because each cell's best path is built from the previous column's best paths, we never re-explore the exponential space.

The three phases

  1. Initialization (first column): δ₁(s) = P(s) · P(w₁ | s).
  2. Recursion (each later column): apply the recurrence above, storing a backpointer to the argmax previous tag.
  3. Termination & backtrace: take the highest-scoring cell in the last column, then follow backpointers right-to-left to read off the best tag sequence.

The cost is O(n · T²) for n words and T tags — linear in sentence length instead of exponential.

Why log-space? Multiplying many small probabilities underflows to zero in floating point. Since log is monotonic, we add log-probabilities instead of multiplying probabilities — the argmax is unchanged, and the numbers stay well-behaved. The demo below works entirely in log-space, so the scores shown are negative (log of a number < 1).

Interactive Viterbi Trellis

This widget runs the real Viterbi algorithm in your browser over a tiny hand-built HMM with the tags DET, NOUN, VERB, and ADJ. Pick a sentence and watch the trellis fill in. Each cell shows log δₜ(s), the orange edges trace the recovered best path, and the tagging is read off below.

Sentence:
DETNOUNVERBADJbookt = 0at = 1flightt = 2-3.79-3.10-2.90-4.89-4.51-6.59-5.31-6.41-9.00-6.82-8.80-7.37
Best tagging:bookVERBaDETflightNOUNlog P = -6.822

What you are seeing: each circle is a trellis cell holding the best log-probability of any tag path that ends in that tag at that word (Viterbi's δ value). The orange path is the single most likely tag sequence, recovered by following the backpointers from right to left.

Try this: compare “book a flight” with “read a book”. The word book is tagged VERB when it starts the sentence, but NOUN after the determiner a — even though P(book | VERB) alone is higher. The transition probabilities (here the strong DET → NOUN preference) override the emission and disambiguate the word from context.

From HMMs to Modern Neural Taggers

HMM taggers were the workhorses of the 1990s and 2000s and are still a great way to understand sequence labeling. But they have limits: the Markov assumption only looks back one tag, emission probabilities treat each word as an atomic symbol (no notion that happily and quickly are similar), and unknown words are handled crudely. Several families of models improved on them:

ModelIdeaTrade-off
HMMGenerative; transitions + emissions, decoded by ViterbiSimple, fast, interpretable; limited features
CRFDiscriminative; scores the whole sequence with rich featuresHigher accuracy; still uses Viterbi to decode
BiLSTM (+CRF)Reads the sentence both directions with learned word/char embeddingsCaptures long context; needs training data and compute
Transformer (BERT)Self-attention over contextual subword embeddings, fine-tunedState of the art (~97-98%); large pretrained model

Modern transformer taggers fine-tune a pretrained encoder such as BERT and put a small classification head on top of each token's contextual embedding. Because the embedding already encodes the surrounding sentence, the tagger usually just predicts each token's tag directly — often without an explicit Viterbi decode. These models reach roughly 97-98% per-token accuracy on English news text, but it is worth remembering that even simple HMM taggers reach the low-to-mid 90s, and a baseline that just assigns each word its most frequent tag already gets around 90%. The hard cases are exactly the ambiguous ones the Viterbi demo illustrates.

Key Takeaways

  • POS tagging assigns a grammatical category (noun, verb, adjective, …) to every word; it is sequence labeling, not dictionary lookup.
  • The Penn Treebank tagset (~36 tags) is the standard fine-grained English inventory; coarse tagsets like Universal Dependencies use ~17 tags.
  • Words are ambiguous out of context (“book a flight” vs “read a book”), so taggers must reason over the whole tag sequence.
  • An HMM models tags as hidden states with transition probabilities P(tagₜ | tagₜ₋₁) and emission probabilities P(word | tag).
  • The Viterbi algorithm is dynamic programming over the trellis: it keeps the best path into each cell plus a backpointer, running in O(n·T²) instead of exponential time.
  • Work in log-space (add log-probabilities) to avoid numerical underflow; the argmax is unchanged.
  • Modern taggers (CRFs, BiLSTMs, transformers like BERT) reach ~97-98% accuracy, but the HMM/Viterbi picture is still the clearest way to understand how context disambiguates tags.