Skip to content
Hande Kafkas

← Writing

April 22, 2026 · 2 min read

Notes on long context windows

A million tokens does not mean a million useful tokens. A short field guide to what context windows actually buy you, and where to spend the budget.

When the first 1M-token context windows shipped, a lot of the public discussion treated them as a retrieval replacement. Drop everything in. Skip the vector database. Just let the model read.

Two years in, that framing has aged badly. Long context is enormously useful — but not as a dumb replacement for selection. The question stopped being can the model see all of it and became what should the model attend to, and in what order.

These are working notes from production, not a survey.

What long context is actually for

The strongest pattern I keep seeing is evidence presentation. You still do the retrieval. You still rank, dedupe, and filter. But instead of cramming five chunks into a tiny window with aggressive truncation, you let the prompt contain the full document, the full conversation history, and the full citation trail — formatted for the model rather than for token efficiency.

The model is not better at finding the needle. It is better at reasoning about the haystack once you've pointed at the right region.

What it isn't for

Three things long context does not solve:

  1. Cost. Every doubling of the prompt roughly doubles inference cost. A 1M-token call is not a free lunch.
  2. Latency. Time-to-first-token grows with prompt length. For interactive products this is the constraint that bites first.
  3. Attention dilution. Past a certain length, models develop quiet preferences for the start and end of the prompt and treat the middle as ambient. The U-shape is real and shows up in evals even on frontier models.

How I budget context now

I think of the window as having three regions, and I try to spend them deliberately:

  • The opening — system prompt, role, hard constraints. This is the most reliably attended-to part of the prompt. Put the rules here, not at the end.
  • The middle — supporting evidence, retrieved documents, conversation history. Long, but with structure. Headings and section breaks help the model navigate.
  • The closing — the actual ask. This is the second most attended-to region. Put the live instruction here, and re-state any constraint you can't afford to have softened.

The boring rule

If your retrieval is bad, a longer context window will mostly let you waste more money faster. Fix retrieval first. Then use the bigger window to give the model room to think, not to skip thinking.