Tokenization & Text Preprocessing

Introduction

Text preprocessing is the crucial first step in NLP pipelines. It transforms raw text into a format suitable for computational analysis. Tokenization breaks text into meaningful units, while preprocessing steps clean and normalize these tokens.

Key Preprocessing Steps

Tokenization

Breaking text into words, subwords, or characters. The foundation for all NLP tasks.

Normalization

Standardizing text: lowercasing, removing punctuation, handling numbers and special characters.

Stemming/Lemmatization

Reducing words to their root form to group related words together.

Interactive Text Preprocessor

Input Text

Splits on whitespace characters

Processed Tokens

Total tokens: 0
Unique tokens: 0
Words: 0

Token Frequency

Tokenization Methods

Word-Level Tokenization

  • Whitespace: Simple split on spaces
  • Punctuation-aware: Handles punctuation as separate tokens
  • Regex-based: Custom patterns for complex rules
  • Language-specific: Handles contractions, compounds

Subword Tokenization

  • BPE (Byte Pair Encoding): Used in GPT models
  • WordPiece: Used in BERT
  • SentencePiece: Language-agnostic
  • Character-level: Ultimate granularity

Common Preprocessing Challenges

Language-Specific Issues

Different languages require different approaches: Chinese/Japanese lack spaces, German has compound words, Arabic is RTL with complex morphology.

Information Loss

Aggressive preprocessing can remove important information. "New York" → ["new", "york"] loses the entity meaning. Case can indicate proper nouns or sentence boundaries.

Domain-Specific Text

Social media (hashtags, @mentions), code (camelCase, snake_case), medical text (abbreviations), all require custom handling.

Modern Tokenization in LLMs

Modern language models use sophisticated tokenization methods that balance vocabulary size with coverage:

GPT/ChatGPT

BPE tokenization with ~50k tokens

BERT

WordPiece with ~30k tokens

T5/mT5

SentencePiece unigram model

Key Takeaways

  • Text preprocessing is essential for converting raw text into model-ready format
  • Tokenization strategy significantly impacts model performance and vocabulary size
  • Different tasks require different preprocessing: translation vs. sentiment analysis
  • Modern subword tokenization handles out-of-vocabulary words better than word-level
  • Preprocessing choices should preserve information relevant to your task
  • Always consider the trade-offs between normalization and information loss