Cost optimization

How to Reduce LLM API Costs: 7 Strategies That Actually Work

Most teams overpay for LLM APIs by 2-5× because they treat pricing-page rates as fixed. Every major provider ships discount mechanisms that stack — caching, batching, routing, and prompt trimming can drop a $1,200/month bill to under $300. Here's how, with exact math.

May 19, 2026 · 7 min read

A typical production workload — 30,000 requests/month, 800 input tokens, 400 output tokens — costs $360/month on GPT-5.4 at sticker price. Apply every strategy in this guide and that same workload drops to $68/month. That's an 81% reduction, and none of it requires switching providers or degrading quality.

These aren't theoretical savings. Each strategy uses a documented API feature or a prompt engineering technique with predictable results. We'll show the math for each one using the same computeCost() function that powers the RealAICost calculator.

1. Enable prompt caching (saves 40-60%)

Every major provider now offers prompt caching: the ability to mark a stable prefix (system prompt, tool definitions, few-shot examples) so it's processed once and reused across requests. The cached portion bills at 10% of the normal input rate.

In a typical chatbot or agent setup, 60-90% of input tokens are identical across requests — the system prompt, tool schemas, and retrieval context. Caching these tokens at 90% off fundamentally changes the economics.

Provider	Normal input	Cached input	Discount
Claude Sonnet 4.6	$3.00/M	$0.30/M	90%
Claude Opus 4.7	$5.00/M	$0.50/M	90%
GPT-5.4	$2.50/M	$0.25/M	90%
GPT-5.5	$5.00/M	$0.50/M	90%

The math: Our baseline workload (800 input tokens, 70% cache hit rate) on GPT-5.4: uncached cost is $2.50 × 0.0008 × 30,000 = $60/mo input. With caching: (0.3 × $2.50 + 0.7 × $0.25) × 0.0008 × 30,000 = $22.20/mo input. That's a 63% reduction on the input side alone.

Implementation is straightforward — Anthropic uses a cache_control block in the messages array, OpenAI uses automatic prefix matching, and Google caches via the cachedContent resource. The key is structuring your prompt so the stable parts come first.

2. Use the Batch API for async work (saves 50%)

Every provider offers a batch endpoint that processes requests asynchronously (typically within 24 hours) at 50% off the entire bill — both input and output tokens.

Batch API is ideal for any workload where you don't need a real-time response: nightly document processing, eval runs, data extraction pipelines, bulk classification, content generation queues. It's the single most underused discount in the LLM ecosystem.

The math: Our 30k-request workload at GPT-5.4 sticker price costs $360/mo. Run it through the Batch API: $180/mo. Stack it with caching and the input portion drops further.

Batch + caching combined on the same workload: input goes from $60 to $11.10 (caching at 70% hit, then batch 50% off), output from $300 to $150 (batch only, output isn't cacheable). Total: $161/mo — 55% off sticker.

3. Route to the right model (saves 30-90%)

The biggest cost lever isn't a discount — it's picking the right model for the job. Most production workloads include a mix of task complexities, but teams often route everything to a single flagship model.

Classification, extraction, JSON formatting, and simple Q&A don't need Opus 4.7 or GPT-5.5. Haiku 4.5 handles most structured tasks at $1/$5 per million — that's 5× cheaper than Sonnet and 25× cheaper than Opus on output.

Task type	Recommended model	Output $/M	vs Opus 4.7
Classification, extraction	Haiku 4.5	$5.00	-80%
Summarization, Q&A	Sonnet 4.6 / GPT-5.4	$15.00	-40%
Complex reasoning, code gen	Opus 4.7 / GPT-5.5	$25-30	baseline

A common pattern is a two-model pipeline: a fast classifier (Haiku) decides whether the request needs a flagship model or can be handled by a mid-tier one. The classifier call costs fractions of a cent and saves dollars on every request it routes down.

4. Trim your prompts (saves 10-30%)

Most prompts carry 10-30% waste in filler phrases, redundant instructions, and verbose formatting that models don't need. Removing this waste directly reduces your token count and your bill.

Common patterns that waste tokens:

Politeness padding: "I would really appreciate it if you could please" → "List" (saves ~12 tokens per occurrence)
Redundant intensifiers: "It is extremely important and absolutely critical that" → "You must" (saves ~10 tokens)
Duplicate instructions: Restating the same constraint in different words. Models follow a clear instruction once; repeating it burns tokens without improving compliance.
Excessive few-shot examples: 5+ examples when 2-3 achieve the same accuracy. Each example is 50-200 tokens of input you pay for on every request.

At scale, these add up. On 30k requests/month with an average 80 wasted tokens per prompt at $3/M (Sonnet 4.6): that's $7.20/month in pure waste. On Opus 4.7 at $5/M: $12/month. Multiply by the number of distinct prompts in your system.

TokenAdvisor identifies these patterns automatically — paste your prompt, get specific recommendations with dollar amounts at your volume.

5. Constrain output length (saves 10-40%)

Output tokens cost 3-6× more than input tokens across every major provider. Yet most API calls let the model decide how long to respond, and models default to verbose.

Three techniques to control output costs:

Set max_tokens to a reasonable ceiling. If you need a one-sentence answer, cap at 100 tokens. The model stops when it's done or hits the limit — you never pay for a runaway response.
Specify output format. "Respond with a JSON object containing only 'category' and 'confidence' fields" produces far fewer tokens than "Explain your reasoning and then classify the input."
Use structured output modes. Anthropic's tool use, OpenAI's JSON mode, and Google's response schemas all constrain the output shape, reducing token count and parse failures simultaneously.

The math: Reducing average output from 400 to 200 tokens on Sonnet 4.6 ($15/M output) saves $15 × 0.0002 × 30,000 = $90/month. That's the biggest single-strategy saving after caching.

6. Watch tokenizer differences across providers

The same English text produces different token counts on different providers. This is invisible on the pricing page but shows up directly in your bill.

Concrete example: a 500-word technical document (code samples, JSON, markdown) measures 680 tokens on GPT-5.4 (o200k_base), 710 on Claude Sonnet 4.6, and 650 on Gemini 2.5 Pro. That's a 9% spread on the same text, which translates to a 9% cost difference at identical per-token prices.

The spread widens on certain content types. Code and structured data (JSON, XML, YAML) can show 15-30% variation across tokenizers. Claude Opus 4.7's newer tokenizer produces up to 46% more tokens than Opus 4.6 on technical content — a hidden cost of upgrading models even when the per-token price stays flat.

Action: Before committing to a provider, run your actual production prompts through each tokenizer and compare. The RealAICost calculator does this automatically — paste your prompt and see exact token counts and costs across all providers side by side.

7. Avoid context-window pricing traps

Gemini 2.5 Pro and 3.1 Pro double their per-token rate when your prompt exceeds 200K tokens. The critical detail: the higher rate applies to the entire prompt, not just the overflow. A 201K-token prompt costs 2× what a 199K-token prompt costs.

If you're running RAG pipelines that occasionally pull enough context to cross 200K, you have an unpredictable cost multiplier on a subset of your traffic. The fix: monitor your prompt lengths, set a hard ceiling below 200K, or switch to a flat-priced model (Claude, GPT) for long-context workloads.

For a deep dive on this topic, see our post on the hidden cost of long context windows.

Putting it all together

Here's what our 30k-request baseline looks like after stacking strategies:

Strategy	Monthly cost	Cumulative saving
Sticker price (GPT-5.4)	$360	—
+ Prompt caching (70% hit)	$322	-11%
+ Output trimming (400→250 tokens)	$266	-26%
+ Prompt trimming (15% input reduction)	$254	-29%
+ Model routing (40% to Haiku)	$168	-53%
+ Batch API on eligible traffic (60%)	$97	-73%

Each row builds on the previous ones. The order matters — caching first (because it reduces the base cost that all other calculations multiply against), then output trimming (highest per-token savings), then input trimming, then routing, then batch.

The exact numbers depend on your workload shape. Run your actual prompt through the RealAICost calculator with your real volume to see what each strategy saves for your specific case.

Tools to help

RealAICost — paste your prompt, set your volume, see the real cost across all models with caching and batch discounts built in.
TokenAdvisor — analyzes your prompt for waste patterns and gives specific savings recommendations with dollar amounts.

See what your actual prompt costs

Paste your real prompt, set your volume, toggle caching and batch discounts. See the exact cost across every flagship model — no signup, no tracking.

Open the calculator →

Prices verified against Anthropic, OpenAI, and Google official pricing pages on May 19, 2026. Calculations use the same computeCost() math as the live calculator. RealAICost is open source — code at github.com/darkknight4563/realaicost. Not affiliated with any model provider.