RMRM Full Stack & AI Engineer · All guides · Roadmaps
AI & ML · guide

Tokens Explained (LLMs)

Tokens are the fundamental units of text that large language models read, process, and generate. Understanding tokens is essential for working effectively with any LLM API, controlling costs, and reasoning about model behavior and limitations.

What Is a Token?

A token is a chunk of text that a language model treats as a single unit — it is neither always a word nor always a character. Common words like 'cat' are typically one token, while longer or rarer words like 'tokenization' may be split into multiple tokens such as 'token' and 'ization'. Punctuation, spaces, and special characters also consume tokens. On average, one token corresponds to roughly 3–4 characters or about 0.75 English words.

How Tokenization Works

Before text enters a model, a tokenizer converts raw strings into sequences of integer IDs using a fixed vocabulary built during training. The most common algorithm is Byte-Pair Encoding (BPE), which iteratively merges the most frequent character pairs to form subword units, balancing vocabulary size against coverage. Each model family (GPT, Llama, Gemini, etc.) ships its own tokenizer and vocabulary, so the same string can produce different token counts across models. You can inspect tokenization using tools like OpenAI's Tokenizer Playground or the Hugging Face tokenizers library.

Why Tokens Matter for Context and Cost

LLMs have a fixed context window measured in tokens — for example, 128,000 tokens — which caps how much text the model can 'see' at once, including both the input prompt and the generated output. API pricing is almost universally based on input and output token counts, so token awareness directly affects cost at scale. Verbose prompts, large documents, and long conversation histories consume context space quickly, potentially causing the model to forget earlier content through truncation.

Prompt and Completion Tokens

Token counts are split into two categories in most APIs: prompt tokens (everything you send in) and completion tokens (everything the model generates back). Completion tokens are often priced higher because generation is more computationally expensive than encoding the input. Setting a max_tokens parameter caps how many tokens the model will generate in its response, preventing unexpectedly long and costly outputs.

Key Gotcha: Non-English and Special Content

Tokenization efficiency varies significantly by language and content type. Non-Latin scripts such as Chinese, Arabic, or Hebrew are often tokenized into far more tokens per word than equivalent English text, making multilingual use cases disproportionately expensive and context-hungry. Code, JSON, URLs, and numbers with many digits also tokenize inefficiently. Always test token counts for your specific content domain rather than relying on the English-language rule of thumb.

Best Practice: Count Before You Send

Always count tokens programmatically before sending a request, especially when building production applications with dynamic content. Use the official tokenizer library for your model — for OpenAI models use the tiktoken library, for Hugging Face models use AutoTokenizer — to get exact counts. Implement chunking strategies for large documents to stay within context limits, and monitor token usage in API responses to track spend and catch prompt-bloat regressions early.

Go deeper with an AI tutor that teaches this in context — and quizzes you on it.
Open the app — free to start

© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app