What is a token in AI?

A token is the small chunk of text an AI reads and writes in - often a whole short word, sometimes part of a longer word, sometimes just punctuation. A useful rule of thumb is that one token is about four characters and 100 tokens is roughly 75 words in English. Everything you send the AI and everything it writes back is counted in tokens, and that count is what you are billed on.

Why does my AI cost go up over a long conversation?

In many setups, each new message re-sends the entire conversation so far as context, so a long thread quietly costs more with every turn even if your individual messages are short. The same happens when a system pastes a big document into the context on every request. The fix is to start fresh threads for new tasks and use retrieval to fetch only relevant passages instead of resending everything.

How can I reduce my AI token costs?

The biggest saver at scale is not pasting the same large document every time - use retrieval to fetch only relevant passages instead. Beyond that: keep prompts focused, start new threads for new tasks, match cheaper models to routine work, ask for only the output length you need, and reuse fixed context where possible. None of these hurt quality, and most improve it because a focused prompt produces a sharper answer.

Does a bigger context window cost more?

A larger window lets the model take in more at once, but you only pay for the tokens you actually use, not the window's full size. The cost comes when you fill that bigger window with lots of text on every request. So a big window is an enabler, not an automatic expense - the bill is driven by how many tokens you send in and get back, which is why efficient design matters more than window size.

AI Tokens and Context Windows, Explained (and Why They Cost You)

Q: What is a context window?

The context window is the maximum amount of text, measured in tokens, that the model can hold in mind at once - your request plus its answer together. Think of it as the model's desk space: everything relevant has to fit at once, and anything that does not fit falls off. This is why an AI sometimes seems to forget earlier parts of a long conversation - they fell outside the window to make room for newer text.

AI tokens and context windows explained in plain English: what they are, why they decide your AI cost and limits, and practical tips to spend less without losing quality.

A token is the small chunk of text an AI reads and writes in - roughly a word or a piece of one - and a context window is the maximum amount of text the AI can hold in mind at once, measured in those tokens. They matter because AI tools charge you by the token and limit you by the context window. Understand these two ideas and you understand both why your AI bill is what it is and why the model sometimes seems to "forget" things.

Almost nobody explains tokens and context windows in plain terms before handing a business an AI tool, and it leads to surprise bills and confused expectations. In this guide I will define both clearly, show you why they directly drive cost and limits, give rough real-world examples, and share the practical tips I use to keep AI spending sensible without sacrificing quality. If you want the bigger picture of how these models work first, my guide to what an LLM is is the natural starting point.

What is a token?

An AI model does not read text the way you do, letter by letter or even word by word. It breaks text into tokens: small pieces that are often a whole short word, sometimes part of a longer word, and sometimes just a punctuation mark or a space. "Cat" is one token. "Unbelievable" might be three. A comma is its own token.

The rough rule of thumb that is good enough for planning: in English, one token is about four characters, and 100 tokens is roughly 75 words. So a short paragraph is around 100 tokens, a one-page document is maybe 500 to 800, and a long report can be many thousands. Hebrew and other non-English languages often use more tokens per word, which is worth knowing if you work bilingually - the same message can cost more in Hebrew than in English.

Why does this odd unit exist? Because tokens are how the model actually processes language - it predicts text one token at a time. Everything the AI reads from you and everything it writes back is counted in tokens, and that count is the meter that runs your bill.

What is a context window?

The context window is the maximum number of tokens the model can consider at one time - the request you send plus the answer it generates, all together. Think of it as the model's working memory or its desk space. Everything relevant to the current task has to fit on that desk at once. If it does not fit, something has to come off.

This is the single best explanation for a behavior that confuses people: why an AI sometimes "forgets" what was said earlier in a long conversation. It did not forget in a human sense. The earliest parts simply fell outside the context window - off the edge of the desk - to make room for newer text. The model can only work with what is currently on the desk.

Context windows have grown enormously and keep growing in 2026, from a few thousand tokens in early models to hundreds of thousands or more in the largest ones today. A bigger window means the AI can take in a whole long document, a big chunk of a codebase, or a long conversation in one go. But a bigger window is not free, which is exactly where cost comes in.

Why tokens and context windows cost you money

Here is the part that hits your budget. AI providers bill based on tokens - both the tokens you send in (your prompt and any documents) and the tokens the model writes back (its answer). More text in, more text out, higher cost. Two things follow from this directly.

What you do	Token impact	Cost effect
Ask a short question, get a short answer	Low tokens in and out	Cheap
Paste a long document for context every time	High tokens in, repeated	Adds up fast
Keep a very long chat going	The whole history is re-sent each turn	Grows with every message
Ask for a long, detailed output	High tokens out	More expensive per call

The non-obvious one that catches businesses out is the long chat. In many setups, each new message re-sends the entire conversation so far as context, so a very long thread quietly gets more expensive with every turn, even if your individual messages are short. The same goes for any system that stuffs a big document into the context on every single request - you pay to re-read that document each time.

The context window also sets a hard ceiling. You cannot feed in more than fits, so a document larger than the window cannot go in whole. That is one of the practical reasons techniques like RAG (retrieval-augmented generation) exist: instead of cramming an entire knowledge base into the window every time, the system retrieves only the few relevant passages, which is both cheaper and able to handle far more material than any window could hold.

Rough cost examples to anchor your thinking

Exact prices change constantly and vary by model, so I will keep this in relative terms that stay useful. The point is the shape of the costs, not the cents.

A single quick question and answer: a few hundred tokens total. Effectively trivial - fractions of a cent on most models.
Summarizing a 10-page document: perhaps 5,000 to 8,000 tokens in, plus a few hundred out. Still cheap as a one-off, but multiply it by thousands of documents and it becomes a real line item.
A support bot answering with your full policy pasted in each time: the policy tokens are paid on every single customer question. At scale, this is where bills balloon, and where a smarter design pays for itself.
A long agent task with many back-and-forth steps: each step re-sends context, so a chatty multi-step job can cost many times a single question. This is one reason AI agents cost more to run than fixed automation, as I explain in my guide to what an AI agent is.

The pattern to internalize: a single use is almost always cheap, and the cost story is entirely about volume and repetition. Token costs are death by a thousand cuts, not one big bill - which is good news, because it means small design choices have a large cumulative effect.

Practical tips to spend less without losing quality

You do not need to be stingy with AI to control cost. You need to be deliberate. Here are the levers I actually pull for clients.

Do not paste the same big document every time. If an AI needs to answer from a large knowledge base, use a retrieval approach that fetches only the relevant passages instead of resending everything. This is the single biggest saver at scale.
Keep prompts focused. Send what the model needs, not your entire history. Trim boilerplate and irrelevant context out of each request.
Start fresh conversations for new tasks. A long-running chat re-sends its whole history. When you move to an unrelated task, start a new thread so you are not paying to re-read an old one.
Match the model to the job. Smaller, cheaper models handle routine tasks well. Save the largest, priciest models for the work that genuinely needs them.
Ask for the output length you need. If you want a one-line answer, say so. You pay for every token the model writes, so an unnecessarily long response costs more.
Cache and reuse where you can. If many requests share the same fixed context, well-built systems can avoid paying for it repeatedly. This is an engineering choice worth making for high-volume use.

None of these hurt quality. Most actually improve it, because a focused prompt with only relevant context tends to produce a sharper answer than one buried in noise. Good token discipline and good results usually go hand in hand.

The bottom line on tokens and context windows

Tokens are the unit AI reads, writes, and bills in. The context window is how much it can hold at once. Together they explain your AI cost, your size limits, and why the model sometimes forgets. The businesses that run AI affordably are not the ones using it less - they are the ones designing for token efficiency so they get the same value for a fraction of the spend.

If you are seeing AI costs creep up, or you are planning a project and want to build it cost-efficiently from the start, book a call and tell me what you are running or planning. I will show you where the tokens are going and the design changes that cut the bill without cutting quality. You can also reach me through the contact form, and if you want to understand the model side first, start with my guide to what an LLM is.