Context Windows
A fixed window onto the conversation
An LLM has no long-term memory between calls. On each request it can only read a fixed number of tokens — its context window. Everything shares that one budget: the system prompt, the entire chat history, any retrieved documents, your new question, and the room left for the answer. Run out of room and something has to give.
Input and output are counted together against the same token limit.
Apps re-send the history every turn — the model doesn't remember on its own.
More context means more compute, higher latency and higher price per call.
Fill it, overflow it, and the middle slump
Watch a toy 24-token window fill up as a chat grows, overflow and evict the oldest turns, then see the "lost in the middle" effect — models attend most to the start and end of a long context.
Working with the limit
Replace old turns with a short running summary to free up tokens.
RAG fetches only the relevant chunks instead of stuffing whole documents in.
Put the most important instructions near the start or end, not buried in the middle.
Even with 100k+ token windows, models often use the middle of a long context poorly ("lost in the middle"), and every extra token costs money and time. A focused, well-ordered prompt usually beats a giant one.
Run the budget yourself. Same toy 24-token window: the system prompt costs 2 tokens, every chat turn costs 3, and 4 are always reserved for the reply (dashed cells). Grow the conversation and watch it overflow — then flip on summarization and watch the same conversation fit with room to spare.
At 7+ turns the window is full and the oldest turns start falling off the left edge — the model genuinely forgets them. With summarization on, everything but the last two turns compresses into one orange 3-token summary, and even 12 turns never overflow. This is exactly what chat apps do behind the scenes.