AI Blog

AI Blog

by Michele Laurelli

Tokenization

/ˌtoʊkənaɪˈzeɪʃən/
Technique
Definition

The process of breaking text into smaller units (tokens) like words, subwords, or characters for processing.

Tokenization is the first step in NLP pipelines. Methods include word-level, character-level, and subword tokenization (BPE, WordPiece). Balances vocabulary size and representation quality.

Examples

1

BPE in GPT models

2

WordPiece in BERT

3

SentencePiece for multilingual models

Michele Laurelli - AI Research & Engineering