by Michele Laurelli
The process of breaking text into smaller units (tokens) like words, subwords, or characters for processing.
Tokenization is the first step in NLP pipelines. Methods include word-level, character-level, and subword tokenization (BPE, WordPiece). Balances vocabulary size and representation quality.
BPE in GPT models
WordPiece in BERT
SentencePiece for multilingual models