The trap is shipping the demo. Your application works in development. The API responds in under a second. The outputs are clean. You deploy to production and go to bed. At 3 AM, Anthropic's servers hit a traffic spike. Your application gets rate-limited. The retry logic you never wrote is not retrying. The error message your users see is a raw Python traceback. Your on-call engineer wakes up, stares at a 500 Internal Server Error, and has no idea whether the problem is your code, the API, or the network.
The difference between a prototype and a production system is not features. It is error handling.
The difference between a prototype and a production system is not features. It is error handling.
API errors are not bugs in your code. They are a normal part of operating a system that depends on an external service over the network. The Claude API will return errors. The network will drop packets. The server will be overloaded during peak hours. Your job is not to prevent these failures -- you cannot -- but to handle them so gracefully that your users never notice.
Every error the Claude API returns has an HTTP status code and a JSON body. The status code tells you the category. The body tells you the specifics. Here is the taxonomy I keep in my head:
400 -- Invalid Request. Your input is malformed. Bad JSON, missing required fields, a parameter outside its allowed range. This is your bug. Fix the code, do not retry.
401 -- Authentication Error. The API key is missing, expired, or wrong. Check your environment variables. This never fixes itself -- retrying is pointless.
403 -- Permission Error. The key is valid but does not have permission for the requested resource. You are calling a model your plan does not include, or hitting an endpoint your key is not scoped for.
404 -- Not Found. The endpoint or resource does not exist. You mistyped the model name, or you are calling a deprecated endpoint.
413 -- Request Too Large. The input exceeds the allowed size. Your prompt plus conversation history plus system prompt exceeds the model's context window. Trim the input and try again.
429 -- Rate Limited. You are sending requests faster than Anthropic allows. This is the most common production error and the only one that requires a proper retry strategy.
500 -- Internal Server Error. Something broke on Anthropic's side. Retry with backoff.
529 -- Overloaded. The service is under heavy load. Similar to a 500 but with a clearer signal: back off and try later.
Not every error deserves a retry. 400-level client errors (except 429 and 413) are your fault -- fix the code. 429 and 5xx errors are transient -- retry with exponential backoff. The decision tree is: Is it my fault? Fix it. Is it temporary? Retry it. Is it permanent? Surface it.
The naive retry -- catching any exception and immediately re-sending the request -- is worse than no retry at all. If the API is rate-limited and you immediately retry, you are adding to the very traffic that caused the rate limit. You are pouring gasoline on the fire.
The pattern that works in production is exponential backoff with jitter. Each retry waits longer than the last, and a random jitter prevents multiple clients from retrying in lockstep (the thundering herd problem).
import anthropic
import time
import random
def send_with_retry(
client: anthropic.Anthropic,
messages: list[dict],
model: str = "claude-sonnet-4-20250514",
max_retries: int = 3,
base_delay: float = 1.0,
) -> str:
"""Send a message to Claude with exponential backoff retry."""
last_exception = None
for attempt in range(max_retries):
try:
response = client.messages.create(
model=model,
max_tokens=1024,
messages=messages,
)
return response.content[0].text
except anthropic.RateLimitError as e:
last_exception = e
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited (attempt {attempt + 1}/{max_retries}). "
f"Waiting {delay:.1f}s...")
time.sleep(delay)
except anthropic.InternalServerError as e:
last_exception = e
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Server error (attempt {attempt + 1}/{max_retries}). "
f"Waiting {delay:.1f}s...")
time.sleep(delay)
except anthropic.APIStatusError as e:
# 400-level errors (except 429) are not retryable
raise
raise last_exception
The key decisions in this code: RateLimitError and InternalServerError get retried with backoff. Other APIStatusError exceptions (bad request, auth failure, permission denied) are raised immediately because retrying will never fix them. The delay doubles on each attempt (1s, 2s, 4s) with a random jitter of up to one second. After three failed attempts, the last exception is re-raised so the caller can decide what to do.
The official anthropic Python SDK retries automatically on 429 and 5xx errors with exponential backoff. If you are using the SDK's default client, you get basic retry logic for free. The custom implementation above is for when you need more control -- custom logging, different retry counts per error type, or circuit breaker integration.
Rate limits and timeouts are different failures that require different responses. A rate limit means "slow down." A timeout means "the request took too long." Treating them the same way -- which most retry wrappers do -- leads to subtle bugs.
import anthropic
import time
client = anthropic.Anthropic(timeout=30.0) # 30-second timeout
def ask_claude_resilient(
user_message: str,
max_retries: int = 3,
) -> str:
"""Handle rate limits, timeouts, and connection errors separately."""
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}],
)
return response.content[0].text
except anthropic.RateLimitError:
wait = (attempt + 1) * 2
print(f"Rate limited. Waiting {wait}s before retry.")
time.sleep(wait)
except anthropic.APITimeoutError:
# Timeout: the request might still be processing.
# Wait longer before retrying.
wait = (attempt + 1) * 5
print(f"Request timed out. Waiting {wait}s before retry.")
time.sleep(wait)
except anthropic.APIConnectionError:
# Network issue: DNS failure, connection refused, etc.
wait = (attempt + 1) * 3
print(f"Connection error. Waiting {wait}s before retry.")
time.sleep(wait)
raise RuntimeError(
f"Failed after {max_retries} attempts. "
"Check network connectivity and API status."
)
Notice the different wait times. Rate limits get a 2-second base delay because the API told you to slow down -- a short pause is usually enough. Timeouts get a 5-second base delay because the server might be under heavy load, and hammering it with retries makes the problem worse. Connection errors get a 3-second delay because the issue is likely transient but may take a moment to resolve.
A rate limit means "slow down." A timeout means "the server is struggling." A connection error means "the network is broken." Each failure has a different cause and deserves a different response.
In a real application, retry logic is just one layer. The full stack looks like this:
Input validation -- catch malformed requests before they hit the API. Check that messages are not empty, that the model name is valid, that max_tokens is within bounds.
Retry with backoff -- handle transient failures (429, 5xx) automatically.
Output validation -- parse and validate the model's response before passing it to the rest of your application. If you asked for JSON, parse the JSON. If you asked for a category from a fixed list, check that the response is in the list.
Fallback responses -- when all retries fail, return a graceful degradation instead of a crash. A human-readable error message. A cached response. A "please try again later" with a request ID the user can reference.
Structured logging -- log every error with the request ID, the error type, the attempt count, and the full error message. Without logs, debugging production failures is archaeology.
import anthropic
import json
import logging
import time
import random
logger = logging.getLogger(__name__)
def reliable_api_call(
client: anthropic.Anthropic,
messages: list[dict],
system: str = "",
expect_json: bool = False,
max_retries: int = 3,
) -> dict:
"""Production-grade API call with full error handling stack."""
# Input validation
if not messages:
raise ValueError("Messages list cannot be empty")
last_error = None
for attempt in range(max_retries):
try:
kwargs = {
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": messages,
}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
text = response.content[0].text
# Output validation
if expect_json:
try:
parsed = json.loads(text)
return {
"status": "success",
"data": parsed,
"tokens": {
"input": response.usage.input_tokens,
"output": response.usage.output_tokens,
},
}
except json.JSONDecodeError:
logger.warning(
"Invalid JSON response on attempt %d: %s",
attempt + 1,
text[:200],
)
# Retry -- the model might produce valid JSON next time
last_error = "Invalid JSON in response"
continue
return {
"status": "success",
"data": text,
"tokens": {
"input": response.usage.input_tokens,
"output": response.usage.output_tokens,
},
}
except anthropic.RateLimitError as e:
delay = (2 ** attempt) + random.uniform(0, 1)
logger.warning("Rate limited (attempt %d). Retrying in %.1fs",
attempt + 1, delay)
last_error = str(e)
time.sleep(delay)
except anthropic.InternalServerError as e:
delay = (2 ** attempt) + random.uniform(0, 1)
logger.error("Server error (attempt %d): %s", attempt + 1, e)
last_error = str(e)
time.sleep(delay)
except anthropic.APIStatusError as e:
logger.error("Non-retryable API error: %s", e)
return {"status": "error", "error": str(e), "retryable": False}
# All retries exhausted
logger.error("All %d attempts failed. Last error: %s",
max_retries, last_error)
return {
"status": "error",
"error": f"Failed after {max_retries} attempts: {last_error}",
"retryable": True,
}
This function returns a dictionary with a status field, not raw text. Every caller can check result["status"] before accessing the data. Failed calls return error details instead of raising exceptions, which means the calling code can decide whether to show a fallback, queue a retry, or alert an operator.
The best error handling is the error that never happens. A few proactive measures significantly reduce the chance of API failures reaching production.
Validate inputs before sending. Check that messages are not empty. Verify that the model name is a string that matches a known model. Confirm that max_tokens is a positive integer within bounds. These checks catch bugs in your code before they become 400 Invalid Request errors that clutter your logs.
Stay within rate limits by design. If you know your plan allows 60 requests per minute, build a rate limiter into your client that enforces 50 requests per minute. Leave headroom. The cheapest way to handle rate limits is to never hit them.
Keep your SDK and model versions current. Deprecated model versions eventually stop working. Outdated SDK versions may not support new features or error codes. Pin your SDK version in requirements.txt, but check for updates monthly.
Monitor before you get paged. Set up a simple counter that tracks API call success rate. If the success rate drops below 95% in any five-minute window, send an alert. This catches problems hours before users start complaining.
I have seen teams treat error handling as a feature they will add in the next sprint. It never makes it to the next sprint. By the time they circle back, they have a production system with bare except Exception: pass blocks scattered across the codebase, swallowing errors silently and making every outage a mystery. Build the error handling first. Build the features on top of it.
Error handling is not a feature you add after launch. It is the load-bearing structure that keeps your application running when the external world misbehaves. Build the retry logic, the output validation, and the fallback responses before you build the happy path. The happy path is easy. The 2 AM failure path is where your engineering quality shows.
Initialize your Anthropic client with a timeout parameter. Thirty seconds is a reasonable default. Without an explicit timeout, a hung request can block your application indefinitely.
Search your codebase for bare except Exception blocks around API calls. Replace each one with specific handlers for RateLimitError, APITimeoutError, APIConnectionError, and APIStatusError. Each error type needs a different response.
Log every API error with at least four fields: timestamp, error type, attempt count, and the first 200 characters of the error message. If you are not logging errors, you are flying blind in production.
For every API call that expects structured output, add a json.loads() call immediately after receiving the response. If it fails, retry the request or return a well-defined fallback. Never pass unparsed model output downstream.
For every user-facing API call, define what happens when all retries fail. A static message, a cached previous response, a "try again in a minute" notice -- anything is better than a stack trace.
If you are not logging errors, you are flying blind in production.