Skip to content

#preprocessing

1 approved public terms with this tag.

Tokenization

/ˌtoʊkənɪˈzeɪʃən/noun
AI & Technology

The process of converting raw text into discrete units called tokens that a language model can process. Tokens are typically subword units — common words become single tokens while rare words split into multiple tokens. All LLM pricing and context limits are measured in tokens, not characters or words.

The word "unbelievable" tokenized into three pieces: "un", "believ", "able".