You've probably noticed something strange about AI. Ask it to count the letters in a word and it sometimes gets it wrong. Ask it to reverse a sentence and it stumbles. These aren't signs of stupidity — they're direct consequences of something called tokenization. Once you understand what tokens are, the quirks of AI start making a lot more sense.
Tokenization is one of those concepts that sits quietly under every single AI interaction you've ever had, and almost nobody talks about it in plain terms. That changes today. By the time you finish this guide, you'll understand exactly how your words get processed, why token limits affect your conversations, and how to use AI more effectively because of it.
Tokenization in AI is the process of breaking text into smaller pieces called tokens before the model processes it. Tokens are the basic unit of language that AI works with — computers can't process words directly, so everything must be converted into tokens first, then into numbers.
- A token ≈ ¾ of a word. On average, 100 tokens is about 75 words of English text.
- Common words = 1 token. The word "the" is one token. "Unbelievable" might be two or three.
- Tokens affect cost & limits. AI APIs charge per token, and every model has a max token limit (context window).
- Tokenization varies by language. Non-English text often uses significantly more tokens for the same content.
- It happens before anything else. Before the AI can understand or generate anything, it must tokenize your input.
01The Plain English Definition
Let's start from the very beginning. Computers cannot read text the way humans do. When you see the word "hello," your brain processes it as a whole, familiar unit. A computer sees a string of characters: h, e, l, l, o. Before an AI model can do anything intelligent with language, it needs a way to convert human text into something it can work with mathematically — and that's exactly what tokenization does.
Tokenization is simply the process of splitting text into chunks called tokens. Think of it as breaking a sentence into LEGO bricks. Each brick is a token. The AI picks up these bricks, examines them, figures out how they fit together, and uses that understanding to build a response — one token at a time.
Imagine you want to send a very long letter, but the post office can only handle envelopes of a certain size. You'd need to split your letter across multiple envelopes — cutting it at sensible points (ideally between paragraphs or sentences) so each envelope contains a meaningful chunk.
Tokenization works similarly. Your text gets split into manageable, meaningful chunks that the AI's "sorting office" (the model architecture) can process individually. The AI then reassembles the meaning of all those chunks together to understand what you're saying and generate a useful reply.
The key difference: the AI is doing this at the speed of microseconds, with billions of parameters deciding how each token relates to every other token in your message.
This is deeply connected to how AI models process language overall. If you want the full picture of what happens once tokens are created, our guide on how AI understands human language walks through the complete pipeline from your first keystroke to the final response.
02Why Tokenization Actually Matters to You
You might be thinking: "Okay, but why should I care about how AI splits up my text internally?" The answer is that tokenization has three very practical, real-world consequences that directly affect your experience with AI tools every single day.
1. Context Windows — Why AI "Forgets" Long Conversations
Every AI model has a maximum number of tokens it can process at once — this is called the context window. Think of it as the AI's working memory. If your conversation gets longer than the context window allows, the AI starts dropping earlier parts of the conversation to make room for the new content.
This is why, in a very long chat, an AI might seem to "forget" something you said at the start. It hasn't actually forgotten — that content has been pushed out of the context window. You can learn exactly how this works in our detailed explainer on what the context window in AI models really means.
2. API Costs — Tokens Are the Unit of Billing
If you use AI through an API (like developers do when building apps), you pay per token — both for the tokens you send in (your prompt) and the tokens the AI generates back (the response). Understanding token counts helps you write more efficient prompts and estimate costs accurately. We've included a token cost calculator further down this page to help with exactly that.
3. Accuracy on Character-Level Tasks
Here's something that surprises most people: AI is surprisingly bad at counting letters in words. Ask ChatGPT "how many R's are in 'strawberry'?" and it might get it wrong. This happens directly because of tokenization. The AI doesn't see individual letters — it sees tokens. If "strawberry" is treated as a single token, the model has no way to inspect individual characters unless it's specifically designed to do so. Once you know this, these "silly" AI mistakes stop being surprising.
03How Tokenization Actually Works
Here's the step-by-step process of what happens from the moment you hit send on a message to an AI:
Pre-processing: Clean the text
Before anything else, the text gets lightly cleaned — punctuation is handled, special characters are accounted for, and the text is normalised. This ensures consistent processing regardless of how you typed things.
Splitting into tokens
The tokenizer splits your text using a learned vocabulary — typically 50,000 to 100,000 possible tokens. Common words like "the," "is," and "you" each get their own single token. Rarer words and technical terms get split into sub-word pieces. The word "tokenization" itself might become two tokens: "token" and "ization."
Token IDs: Assigning numbers
Each token gets replaced by a unique integer — its token ID. So "hello" might be token 15496. "World" might be token 995. The AI model never actually sees words; it only ever works with these numbers from this point onwards.
Embeddings: Numbers become meaning
Each token ID is then converted into a rich numerical vector called an embedding — a long list of numbers that encodes what the token means and how it relates to other tokens. These embeddings are what the transformer model actually processes. To go deeper on this step, read our piece on what happens inside a neural network.
Generation: One token at a time
When the model generates a response, it doesn't write a whole sentence at once. It predicts the most likely next token, adds it to the context, then predicts the next one, and the next — until the response is complete. This is why there's a slight streaming effect when you watch AI type its reply.
04Try It: Live Token Counter
Type any sentence below and watch it get split into tokens in real time. This gives you an intuitive feel for how AI actually "sees" your words.
Each coloured block below represents a single token. Notice how common words get one token, while longer or rarer words get split into multiple pieces.
05What Is BPE? The Algorithm Behind Modern Tokenization
Most major AI models — including GPT-4, Claude, and Gemini — use a tokenization method called Byte Pair Encoding (BPE). It sounds complicated but the idea is elegantly simple.
BPE starts with every individual character as a separate token — so "hello" would start as five tokens: h, e, l, l, o. Then it looks at the training data and finds the most common pair of tokens. If "t" and "h" appear together very often, they get merged into a single token "th." Then the next most common pair gets merged. And so on. This process continues thousands of times until the vocabulary has reached its target size — usually around 50,000 to 100,000 tokens.
The result is a vocabulary where common English words each get their own single token, while rare words, technical jargon, or words in less common languages get split into smaller sub-word pieces. This is why:
| Word / Phrase | Approx. Tokens | Why |
|---|---|---|
| the | 1 | Extremely common — gets its own token |
| hello | 1 | Very common English word |
| tokenization | 2–3 | Less common — split into sub-words |
| unbelievable | 3–4 | Long word with common prefixes/suffixes |
| supercalifragilistic | 6–8 | Rare word — split into many pieces |
| 😊 | 1–3 | Emoji — often multi-byte, may split |
| 100 words of English | ~133 | Average ratio: 1 word ≈ 1.33 tokens |
This also explains something counterintuitive: when you add spaces before words or use unusual capitalisation, you can sometimes get different token counts. The tokenizer is sensitive to these details in ways that often surprise people who are new to working with AI.
Token Cost Calculator
If you're using AI through an API, every token has a price. Use this calculator to estimate what your prompts cost across different models:
Enter your word count and select a model to see an estimated token cost.
06Different Languages, Wildly Different Token Counts
One of the most important — and least discussed — aspects of tokenization is how dramatically it varies across languages. The vocabulary of most major AI models is heavily weighted towards English, because the majority of their training data was in English. This has significant real-world consequences.
The same idea expressed in English might require 10 tokens, but in Arabic, Hindi, or Thai it might require 20–30 tokens to express the exact same thing. This happens because the tokenizer has fewer pre-built tokens for those languages, so it has to split words into smaller character-level pieces more frequently.
| Language | Relative Token Efficiency | Practical Impact |
|---|---|---|
| English | Most efficient (baseline) | Lowest cost, most context capacity |
| French / German / Spanish | ~1.1–1.3× more tokens | Slightly higher cost, minor difference |
| Russian / Greek | ~1.5–2× more tokens | Noticeably more expensive, less context |
| Arabic / Hindi | ~2–3× more tokens | Significantly more expensive per message |
| Thai / Japanese / Chinese | ~2–4× more tokens | Much higher cost, context fills faster |
This is an active area of improvement in the AI industry. Newer models are increasingly being trained with more multilingual data, which helps their tokenizers become more efficient across languages. But for now, if you're building AI applications for a non-English audience, token efficiency is an important cost and performance consideration.
07Common Myths About AI Tokens — Debunked
There's a lot of confusion about tokens in the wild. Let's clear up the most common misunderstandings:
1 token = 1 word
This is the most common misconception. One word averages about 1.3 tokens, but can range from 1 token for very common words to 5+ tokens for complex technical terms.
1 token ≈ 4 characters / ¾ of a word
Use the rough rule: 100 tokens ≈ 75 words. For a more accurate estimate, use the token counter above or OpenAI's Tokenizer tool.
AI reads your text letter by letter
AI doesn't process individual characters at all (unless it's a character-level model, which is rare). It processes tokens — chunks that are much larger than single characters.
AI processes sub-word chunks, not characters
This is why AI sometimes struggles to count letters or spell backwards — it literally doesn't "see" individual characters the way humans do when reading.
More tokens always means better understanding
Simply throwing more tokens at the model doesn't guarantee better results. Quality and relevance of context matter far more than raw quantity.
Focused, clear prompts often outperform long ones
A well-crafted 50-token prompt can get better results than a vague 500-token one. Writing good prompts is a skill — and understanding tokens helps. AI errors often stem from this — more in our guide on why AI sometimes gives wrong answers.
Tokenization is the same across all AI models
OpenAI, Anthropic, Google — they all use different tokenizers with different vocabularies. The same sentence can have slightly different token counts on different platforms.
Each model family has its own tokenizer
GPT models use tiktoken. Claude uses its own Anthropic tokenizer. Gemini uses SentencePiece. Results are similar but not identical — always test with the specific model you're using.
Understanding tokenization also helps explain why AI-generated images work so differently from text generation. Images aren't tokenized the same way words are — they use visual patches or latent space representations instead. If you're curious about that side of things, our article on how AI generates images from text covers the full picture.
08Test Yourself: Token Knowledge Quiz
Ready to check what you've learned? Three questions, instant feedback.
No tricks — just checking your understanding of the key ideas.