The trap is ignoring the bill until it arrives. Teams build their prototype on Claude Sonnet, stuff every API call with a 2,000-token system prompt, keep the full conversation history in every request, and call external tools on every turn. The demo works beautifully. Then they deploy to a hundred users and the monthly API spend hits four figures before anyone checks the dashboard. The response is always the same: panic, followed by slashing context indiscriminately, followed by complaints that "the AI got dumber."
The AI did not get dumber. You cut the information it needed to be smart because you never designed the cost structure in the first place.
The AI did not get dumber. You cut the information it needed to be smart because you never designed the cost structure in the first place.
Cost and latency in the Claude API are driven by the same underlying mechanics. Understanding those mechanics gives you the lever to optimize both without sacrificing output quality. The goal is not to spend less -- it is to spend smarter.
Four factors determine the total cost of every API request. Most developers fixate on one of them and ignore the other three.
Input tokens -- everything you send to the model. Your system prompt, the user's message, the conversation history, any injected context or retrieved documents. This is usually the largest line item, and it is the one most teams have the most control over. A 1,500-token system prompt costs 1,500 input tokens on every single request. Over 10,000 requests per day, that is 15 million input tokens per day just on system prompts.
Output tokens -- everything the model generates. Long explanations, detailed reasoning, verbose code outputs. Output tokens are typically more expensive per token than input tokens. Controlling output length is the second-highest-leverage optimization you can make.
Model choice -- the single biggest multiplier on your bill. Claude Opus costs roughly ten times more per token than Claude Haiku. If you are using Opus for ticket classification, you are paying a premium for capability you are not using.
Tool usage -- every tool call adds tokens. The tool definition is included in the input. The model's decision to call the tool is output. The tool's response becomes additional input on the next turn. A multi-tool pipeline can double or triple the token count of what looks like a simple request.
Treat every API request like a budget. Input tokens are the fixed costs -- system prompt, context, history. Output tokens are the variable costs -- controlled by max_tokens and prompt instructions. Model choice is the unit price. Before optimizing prompts, know where your tokens are going.
Beyond the obvious four factors, several patterns quietly inflate your bill without appearing on any dashboard.
Inefficient chaining. Multi-step agent workflows where each step triggers a full API call. If step two does not actually need the full output of step one -- just a single field from it -- you are paying for tokens you throw away.
Overlong system prompts. System prompts grow over time. Someone adds a rule for edge case handling. Someone else adds a paragraph about tone. Six months later the system prompt is 3,000 tokens and half of it is irrelevant to most requests. Because the system prompt is included in every call, this waste compounds.
Non-deduplicated context. Repeating the same information across multiple turns of a conversation. If you include the company's FAQ in every message instead of putting it in the system prompt once, you are paying for the same tokens over and over.
Unclear prompts that cause retries. A vague instruction leads to a wrong answer. The user rephrases. The model tries again. Each attempt costs tokens. A clear prompt that gets the right answer on the first try is cheaper than a short prompt that requires three rounds of correction.
Full conversation history on every request. A twenty-turn conversation means every new request includes all twenty previous messages as input. The cost of each turn is not constant -- it grows linearly with conversation length.
Every API response includes usage.input_tokens and usage.output_tokens. If you are not logging these on every call, you have no data to optimize against. Add token logging before you add any optimization. You cannot improve what you do not measure.
Choosing the right model for each task is the highest-leverage cost optimization available. It requires no code changes, no prompt rewriting, and no architectural refactoring. You just change a string.
That is a 45% cost reduction from changing two strings. The classification and extraction results are identical -- these tasks do not require the reasoning depth of Sonnet. The summarization task stays on Sonnet because it benefits from the model's ability to compress and rephrase.
The heuristic I use: start every new agent on Haiku. Run your test suite. If the outputs meet your quality bar, keep it on Haiku. If they do not, move to Sonnet. Reserve Opus for tasks where you have empirically proven that Sonnet is insufficient -- deep analysis, complex multi-step reasoning, or safety-critical outputs where a small quality improvement justifies a large cost increase.
Output tokens are more expensive than input tokens, and they are the easiest to control. Two techniques work reliably.
Set max_tokens tightly. If you expect a one-word classification, set max_tokens to 50, not 1024. The model stops generating when it runs out of budget, so a tight limit prevents runaway responses. More importantly, it acts as a forcing function on the prompt -- if you tell the model it has 50 tokens, the system prompt needs to ask for something that fits in 50 tokens.
Instruct the model on response length in the system prompt. "Respond in exactly 3 bullet points" or "Answer in 2-3 sentences" gives the model explicit constraints that reduce output tokens without reducing output quality.
import anthropic
client = anthropic.Anthropic()
def ask_concise(question: str, max_points: int = 5) -> str:
"""Get a concise, token-efficient response."""
response = client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=256,
system=(
f"Answer in exactly {max_points} bullet points. "
"Each point must be one sentence. No preamble, no summary."
),
messages=[{"role": "user", "content": question}],
)
return response.content[0].text
The combination of max_tokens=256 and an instruction to use bullet points keeps the output under 200 tokens for most questions. Compare that to a default call with max_tokens=4096 and no length guidance, which can easily produce 800+ tokens of output for the same question.
A clear prompt that gets the right answer on the first try is cheaper than a short prompt that requires three rounds of correction.
Not every request needs a fresh API call. If the same question gets asked repeatedly -- "What are your business hours?" or "How do I reset my password?" -- the answer does not change between requests. Caching the response and returning it on subsequent identical queries is the cheapest optimization available: zero API calls, zero tokens, zero latency.
The cache does not need to be sophisticated. A dictionary keyed by the hash of the input message works for most applications. Set a TTL (time to live) so stale answers expire. For more complex setups, use Redis or Memcached with the same key structure.
The principle extends beyond identical queries. If your system prompt includes stable reference data -- a product catalog, a FAQ document, a set of company policies -- consider caching the model's first response to common questions about that data. The reference data changes monthly. The questions change daily. The answers to most questions stay the same between reference data updates.
The Anthropic SDK also supports prompt caching at the API level. When you send the same system prompt across multiple requests, the API can cache the processed representation and skip re-processing it. This reduces both latency and cost on the input token side. If your system prompt is large and stable, prompt caching is one of the highest-return optimizations you can enable.
Multi-step workflows are where cost optimization has the biggest impact because the savings multiply across every step.
The pattern I use is dynamic system prompts -- building the system instruction at runtime based on the user's input rather than using a static 2,000-token instruction set for every request:
import anthropic
import re
client = anthropic.Anthropic()
def wants_bullet_points(message: str) -> bool:
"""Check if the user wants a bulleted response."""
keywords = ["point", "bullet", "list", "steps", "items"]
return any(kw in message.lower() for kw in keywords)
def extract_point_count(message: str) -> int:
"""Extract the number of points requested, default to 5."""
match = re.search(r"(\d+)\s*points?", message.lower())
return int(match.group(1)) if match else 5
def build_system_prompt(user_message: str) -> str:
"""Build a minimal system prompt tailored to this specific request."""
if wants_bullet_points(user_message):
count = extract_point_count(user_message)
return (
f"Answer in exactly {count} concise bullet points. "
"Each point: one sentence, no filler."
)
return "Answer in 2-3 sentences. Be direct. No preamble."
def ask_optimized(
user_message: str,
history: list[dict],
) -> dict:
"""Token-optimized API call with usage tracking."""
system = build_system_prompt(user_message)
history.append({"role": "user", "content": user_message})
# Keep history bounded
trimmed = history[-10:] if len(history) > 10 else history
response = client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=512,
system=system,
messages=trimmed,
)
reply = response.content[0].text
history.append({"role": "assistant", "content": reply})
return {
"reply": reply,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
The system prompt is no longer a static document -- it is a four-line instruction built from the user's input. When the user asks for bullet points, the prompt enforces a strict format. When they ask a general question, the prompt enforces brevity. Either way, the system prompt is under 30 tokens instead of the 500-2,000 tokens that most applications send.
The function also returns token usage alongside the reply. This is not optional -- it is the data you need to track your cost per request and identify optimization opportunities.
Cost optimization is not about spending less. It is about spending deliberately. Use Haiku where Haiku suffices. Set max_tokens to match your actual output needs. Build system prompts dynamically instead of shipping a 2,000-token static instruction on every request. Trim conversation history to a fixed window. And log every token so you can prove your optimizations are working.
Before you optimize anything, you need visibility. Here is the minimum viable cost tracking I add to every project:
import anthropic
import logging
from collections import defaultdict
logger = logging.getLogger(__name__)
# Simple in-memory tracker -- replace with your metrics system
token_tracker = defaultdict(lambda: {"calls": 0, "input": 0, "output": 0})
def tracked_call(
client: anthropic.Anthropic,
endpoint: str,
messages: list[dict],
model: str = "claude-haiku-4-20250514",
max_tokens: int = 512,
system: str = "",
) -> str:
"""API call with per-endpoint token tracking."""
kwargs = {
"model": model,
"max_tokens": max_tokens,
"messages": messages,
}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
text = response.content[0].text
# Track usage by endpoint
token_tracker[endpoint]["calls"] += 1
token_tracker[endpoint]["input"] += response.usage.input_tokens
token_tracker[endpoint]["output"] += response.usage.output_tokens
logger.info(
"API call [%s]: %d input, %d output tokens",
endpoint,
response.usage.input_tokens,
response.usage.output_tokens,
)
return text
def print_usage_report():
"""Print a summary of token usage by endpoint."""
print("\n--- Token Usage Report ---")
total_input = 0
total_output = 0
for endpoint, stats in sorted(token_tracker.items()):
avg_input = stats["input"] // max(stats["calls"], 1)
avg_output = stats["output"] // max(stats["calls"], 1)
print(f"{endpoint}: {stats['calls']} calls, "
f"avg {avg_input} in / {avg_output} out")
total_input += stats["input"]
total_output += stats["output"]
print(f"Total: {total_input} input + {total_output} output tokens")
Tag every API call with an endpoint name. At the end of the day, print the report. The endpoint with the highest total input tokens is your biggest optimization target. The endpoint with the highest average output tokens is where you need tighter max_tokens constraints. You cannot make intelligent cost decisions without this data.
Add response.usage.input_tokens and response.usage.output_tokens to your application logs. Tag each entry with the endpoint or function name. Run for one week before making any optimization decisions.
List every place your application calls the Claude API. For each call, ask: does this task require Sonnet-level reasoning? If the answer is "probably not," switch it to Haiku and run your test suite. You will likely find that half your calls can run on a cheaper model without quality loss.
Search your codebase for max_tokens=4096 or calls that omit max_tokens entirely. For each one, look at the actual output length. Set max_tokens to 2x the typical output length. A classifier needs 50 tokens, not 4,096.
Count the tokens in every system prompt. Any prompt over 500 tokens is worth reviewing. Delete instructions that do not change the output. Merge redundant rules. Move stable context to the system prompt and dynamic context to the user message.
If your application maintains conversation history, cap it at 10 messages. Summarize older messages with Haiku and inject the summary into the system prompt. This single change often cuts input token costs by 30-50% for long conversations.
You cannot make intelligent cost decisions without data. Log every token. Tag every call. The dashboard is the foundation of every optimization that follows.