Tokenization & Text Preprocessing

Introduction

Text preprocessing is the crucial first step in NLP pipelines. It transforms raw text into a format suitable for computational analysis. Tokenization breaks text into meaningful units, while preprocessing steps clean and normalize these tokens.

→ Next: Word Embeddings → Related: Sequence Models

Key Preprocessing Steps

Tokenization

Breaking text into words, subwords, or characters. The foundation for all NLP tasks.

Normalization

Standardizing text: lowercasing, removing punctuation, handling numbers and special characters.

Stemming/Lemmatization

Reducing words to their root form to group related words together.

Interactive Text Preprocessor

Input Text

Tokenizer Method

Splits on whitespace characters

Processed Tokens

Total tokens: 0

Unique tokens: 0

Words: 0

Token Frequency

Tokenization Methods

Word-Level Tokenization

Whitespace: Simple split on spaces
Punctuation-aware: Handles punctuation as separate tokens
Regex-based: Custom patterns for complex rules
Language-specific: Handles contractions, compounds

Subword Tokenization

BPE (Byte Pair Encoding): Used in GPT models
WordPiece: Used in BERT
SentencePiece: Language-agnostic
Character-level: Ultimate granularity

Common Preprocessing Challenges

Language-Specific Issues

Different languages require different approaches: Chinese/Japanese lack spaces, German has compound words, Arabic is RTL with complex morphology.

Information Loss

Aggressive preprocessing can remove important information. "New York" → ["new", "york"] loses the entity meaning. Case can indicate proper nouns or sentence boundaries.

Domain-Specific Text

Social media (hashtags, @mentions), code (camelCase, snake_case), medical text (abbreviations), all require custom handling.

Modern Tokenization in LLMs

Modern language models use sophisticated tokenization methods that balance vocabulary size with coverage:

GPT/ChatGPT

BPE tokenization with ~50k tokens

BERT

WordPiece with ~30k tokens

T5/mT5

SentencePiece unigram model

Key Takeaways

Text preprocessing is essential for converting raw text into model-ready format
Tokenization strategy significantly impacts model performance and vocabulary size
Different tasks require different preprocessing: translation vs. sentiment analysis
Modern subword tokenization handles out-of-vocabulary words better than word-level
Preprocessing choices should preserve information relevant to your task
Always consider the trade-offs between normalization and information loss