Context Windows · Suman Bhadra Notes

A fixed window onto the conversation

An LLM has no long-term memory between calls. On each request it can only read a fixed number of tokens — its context window. Everything shares that one budget: the system prompt, the entire chat history, any retrieved documents, your new question, and the room left for the answer. Run out of room and something has to give.

It's a budget prompt + reply

Input and output are counted together against the same token limit.

No memory re-sent each call

Apps re-send the history every turn — the model doesn't remember on its own.

Bigger ≠ free cost & speed

More context means more compute, higher latency and higher price per call.

Fill it, overflow it, and the middle slump

Watch a toy 24-token window fill up as a chat grows, overflow and evict the oldest turns, then see the "lost in the middle" effect — models attend most to the start and end of a long context.

Working with the limit

Summarize history compress

Replace old turns with a short running summary to free up tokens.

Retrieve, don't dump RAG

RAG fetches only the relevant chunks instead of stuffing whole documents in.

Place it well edges win

Put the most important instructions near the start or end, not buried in the middle.

Long context isn't a free lunch

Even with 100k+ token windows, models often use the middle of a long context poorly ("lost in the middle"), and every extra token costs money and time. A focused, well-ordered prompt usually beats a giant one.

Run the budget yourself. Same toy 24-token window: the system prompt costs 2 tokens, every chat turn costs 3, and 4 are always reserved for the reply (dashed cells). Grow the conversation and watch it overflow — then flip on summarization and watch the same conversation fit with room to spare.

conversation turns

At 7+ turns the window is full and the oldest turns start falling off the left edge — the model genuinely forgets them. With summarization on, everything but the last two turns compresses into one orange 3-token summary, and even 12 turns never overflow. This is exactly what chat apps do behind the scenes.