AI Blog

by Michele Laurelli

Back to Glossary

Tokenization

/ˌtoʊkənaɪˈzeɪʃən/

Technique

Definition

The process of breaking text into smaller units (tokens) like words, subwords, or characters for processing.

Tokenization is the first step in NLP pipelines. Methods include word-level, character-level, and subword tokenization (BPE, WordPiece). Balances vocabulary size and representation quality.

Examples

BPE in GPT models

WordPiece in BERT

SentencePiece for multilingual models

Related Terms

GPT (Generative Pre-trained Transformer)

A family of large language models developed by OpenAI that use transformer architecture for text generation.

Michele Laurelli - AI Research & Engineering