Pricing

The Hidden Cost of Long Context Windows (And the Pricing Cliffs Nobody Mentions)

A 1M-token context window doesn't mean 1M tokens at a flat rate. Here's what large prompts actually cost across Claude, GPT, and Gemini — including the tier cliffs that double your bill mid-prompt.

May 14, 2026 · 5 min read

Send a 250,000-token prompt to Gemini 2.5 Pro and you'll pay $0.625. Send 200,000 tokens — 20% less text — and you'll pay $0.25. The smaller prompt is 60% cheaper, because crossing 200K tokens doesn't just bill the extra tokens at a higher rate. It re-bills the entire prompt at the higher tier.

Long context windows are sold as a convenience: dump your whole knowledge base in, skip the RAG pipeline. And sometimes that's the right call. But the pricing has cliffs, tiers, and tokenizer quirks that the marketing pages don't mention. This post walks through what large prompts actually cost — model by model, with the exact math — so you can see the cliffs before you hit them.

What you're actually paying for

The context window is a shared budget. Your system prompt, retrieved documents, conversation history, and the model's output all draw from it. Every token in that window is an input token you pay for, on every request — not amortized across sessions, not cached unless you explicitly set up caching.

A 1M-token context window is a capacity limit, not a pricing model. The price per token is set separately and varies by model and — on some providers — by how many tokens you actually send.

To ground this: at $3/M input (Claude Sonnet 4.6), a 100K-token prompt costs $0.30 per request. At 30,000 requests/month, that's $9,000/month in input alone — before output, before you've done anything clever. Context isn't free just because the window is large.

The pricing cliffs

Gemini 2.5 Pro / Gemini 3.1 Pro — the 200K cliff

Under 200K input tokens, Gemini 2.5 Pro charges $1.25/M input and $10/M output. Cross that threshold and the rate jumps to $2.50/M input and $15/M output.

The critical detail: the higher rate applies to the entire prompt, not just the tokens past 200K. A 199K-token prompt costs $0.249. A 201K-token prompt costs $0.503. Adding 1% more text doubled the bill.

Gemini 3.1 Pro has the same cliff structure: $2/$12 under 200K, $4/$18 above. You can model this precisely in the Gemini 2.5 Pro cost calculator — it detects which tier applies based on your actual prompt length.

Claude — flat pricing, no cliff

Opus 4.7, Opus 4.6, and Sonnet 4.6 all support 1M context at flat per-token pricing. A 100K prompt and a 900K prompt bill at the same rate. No tiers, no thresholds, no re-billing.

This makes Claude the predictable choice for variable-length workloads. If your prompt size fluctuates — some requests pull 50K tokens of context, others pull 300K — you never hit a cliff that doubles a random subset of your traffic.

GPT-5.5 / GPT-5.4 — flat, but watch the tokenizer

OpenAI also prices flat across the 1M context window. No tier jumps. But o200k_base tokenizes differently than Claude's tokenizer. The same English document produces a different token count on each, which means the "same" prompt costs different amounts on different providers even at identical per-token rates.

The tokenizer multiplier — a cliff hiding inside the token count

Claude Opus 4.7's new tokenizer produces 1.0–1.46× more tokens than Opus 4.6 for the same text. For a 200K-token document measured on Opus 4.6's tokenizer, the same document might be 230K–290K tokens on Opus 4.7.

On Gemini, that kind of variance is the difference between staying under the 200K cliff and falling off it. This is the cliff nobody sees because it's upstream of the pricing page entirely — your document didn't get longer, your tokenizer just counted more tokens.

When long context beats RAG on cost

Long context wins when:

Low-to-medium query volume (hundreds/day, not millions). RAG infrastructure cost — vector DB hosting at $50–300/mo, embedding costs, re-indexing pipelines — outweighs the per-query token premium.
Content changes frequently. RAG needs re-embedding on every update. Long context reads the current version directly from the source.
Single-document deep analysis. RAG's chunking fragments documents. Long context sees the whole thing, preserving cross-references and structure.

RAG wins when:

High query volume at scale — per-query token savings compound across millions of requests.
The corpus far exceeds what any single query needs — you'd be paying for tokens the model ignores.
The corpus exceeds even a 1M window — RAG isn't optional, it's required.

The honest takeaway: it's not either/or, it's a volume-and-update-frequency calculation. Below roughly 1,000 queries/day on a moderately-sized corpus, long context is often cheaper all-in once you count infrastructure. Above that, RAG's per-query savings start to dominate.

How to keep long-context costs down

Prompt caching is the biggest lever. If your large context is stable — a knowledge base, a codebase, a doc set — cache it. Cached input drops to ~10% of input price. A 200K-token cached context goes from $0.60 to $0.06 per request on Sonnet 4.6.

Watch the tier threshold deliberately. If you're on Gemini and your prompts hover near 200K, either trim below the line or switch to a flat-priced model. The cliff is avoidable if you know it exists.

Batch API for async work. 50% off the entire request. Long-context batch jobs — overnight document processing, bulk analysis — are the ideal candidate for this discount.

Measure tokens on the actual model's tokenizer. A document that's "190K tokens" on one tokenizer may be 230K on another. That's the difference between one pricing tier and the next.

Don't pay for tokens the model ignores. The "lost in the middle" effect is well-documented — models attend less to mid-context information. Stuffing the window with marginally relevant content costs money and can lower answer quality. Be selective about what goes in.

How we calculated this

All figures use each model's official tokenizer (tiktoken for GPT via o200k_base, Anthropic's count_tokens API for Claude, Google's countTokens for Gemini) and pricing verified against vendor pricing pages on May 14, 2026. Tier thresholds and the entire-prompt re-billing behavior for Gemini are documented in Google's pricing docs. The same computeCost() function powers both this article and the live calculator. Code is open source — github.com/darkknight4563/realaicost.

Related reading: GPT-5.5 costs 2× more than GPT-5.4 for the same job — the pricing math behind OpenAI's latest flagship.

See the cliffs on your actual prompt

Paste your real context, set your real volume. The calculator detects which pricing tier applies and shows the cost across all 16 models — including the tier jumps.

Open the calculator →

Prices verified against Anthropic, OpenAI, and Google official pricing pages on May 14, 2026. Calculations use the same computeCost() math as the live calculator. RealAICost is open source — code at github.com/darkknight4563/realaicost. Not affiliated with any model provider.