Part
3
  |  
The API Layer
  |  
Chapter
11

Streaming and Real-Time Output

The standard API call makes your users stare at a blank screen while Claude thinks. Streaming makes them watch Claude think — and that changes everything about perceived performance.
Reading Time
10
mins
BACK TO CLAUDE MASTERCLASS

The trap is optimizing for total response time when you should be optimizing for time-to-first-token. Standard API calls are synchronous: you send a request, wait for Claude to generate the entire response, and then receive it as a single payload. If the response takes eight seconds to generate, your user stares at a loading spinner for eight seconds. Streaming changes the equation. The first token arrives in milliseconds, and the rest flow in continuously. The total time is the same — Claude still needs eight seconds to generate the full response — but the user sees progress from the first moment. That perceptual shift is the difference between an application that feels broken and one that feels alive.

Streaming doesn't make Claude faster. It makes your application feel faster — and in user experience, perception is the only metric that matters.

I've seen teams ship products with standard (non-streaming) API calls and then wonder why users complain about "slowness" even when the total response time is under five seconds. The complaint isn't about speed. It's about silence. Humans interpret a blank screen as "something is wrong." They interpret a gradually appearing response as "the system is working." Streaming solves a UX problem, not a performance problem — but solving the UX problem often matters more.

The evidence for this isn't anecdotal — it's how every major AI chat product works. ChatGPT, Claude.ai, Gemini: they all stream by default. Not because streaming is technically necessary (a single response payload would work fine), but because users tested better with visible progress. The token-by-token appearance creates the "typing" illusion that makes an AI response feel conversational rather than computational. If you're building anything a human looks at, streaming is the expected behavior.

How Streaming Works

In a standard API call, Claude generates the entire response internally and sends it back as one JSON object. With streaming, Claude sends the response in small chunks — called events — as each token is generated. Your application receives these events through a persistent connection and can process them immediately.

The mechanism is Server-Sent Events (SSE): a one-way channel where the server pushes data to the client as it becomes available. You open the connection, Claude starts generating, and tokens arrive in your application as fast as the model produces them.

Here's the fundamental difference in code:

import anthropic

client = anthropic.Anthropic()

# Standard: wait for the complete response
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain how HTTP caching works."}]
)
print(response.content[0].text)

# Streaming: process tokens as they arrive
print("\n--- Streaming ---\n")
with client.messages.stream(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain how HTTP caching works."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
print()  # Final newline

The standard call blocks until the full response is ready. The streaming call yields text chunks as they're generated. The flush=True ensures each chunk is printed immediately rather than buffered — without it, Python's output buffer collects chunks and dumps them in batches, destroying the real-time effect.

The messages.stream() method is a context manager. When you exit the with block, the connection closes and resources are cleaned up. If you need the complete response text or metadata after streaming, call stream.get_final_text() or stream.get_final_message() before the context manager exits.

Framework · The Perception Principle · PP

Users do not measure response time with a stopwatch. They measure it by how long the screen stays empty. Streaming converts dead time (blank screen) into active time (visible progress). An eight-second streaming response feels faster than a four-second blocking response because the user sees movement from the first hundred milliseconds.

The Streaming API in Practice

The Python SDK provides two streaming approaches: the high-level messages.stream() context manager and the lower-level event-based API. The high-level approach handles connection management and gives you a clean text iterator. Use it unless you need fine-grained control over individual events.

import anthropic

client = anthropic.Anthropic()

full_response = ""

with client.messages.stream(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    system="You are a concise technical writer. Use short paragraphs.",
    messages=[{"role": "user", "content": "What are the three most common causes of memory leaks in Python?"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
        full_response += text

# After the stream completes, get the final message for metadata
final_message = stream.get_final_message()
print(f"\n\nStop reason: {final_message.stop_reason}")
print(f"Input tokens: {final_message.usage.input_tokens}")
print(f"Output tokens: {final_message.usage.output_tokens}")

Two things to notice. First, you accumulate the full response yourself by concatenating chunks — the stream gives you fragments, not the complete text. Second, get_final_message() provides the same metadata you get from a standard API call: stop reason, token usage, model info. You need this for logging, billing, and detecting truncated responses.

Always check stop_reason after streaming

A stream that ends because Claude hit max_tokens looks exactly like a stream that ends naturally — the text just stops. The only way to distinguish them is final_message.stop_reason. If it says "max_tokens", the response was truncated. In a streaming context, this is easy to miss because the user sees the text appear progressively and may not notice it ended mid-sentence.

Async Streaming

For applications that need to handle multiple concurrent requests — web servers, batch processors, applications with background tasks — async streaming prevents the streaming loop from blocking other work.

import anthropic
import asyncio

async def stream_response(prompt: str) -> str:
    client = anthropic.AsyncAnthropic()
    full_response = ""

    async with client.messages.stream(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        async for text in stream.text_stream:
            print(text, end="", flush=True)
            full_response += text

    print()
    return full_response

async def main():
    result = await stream_response("What is the GIL in Python and why does it matter?")
    print(f"\nFull response length: {len(result)} characters")

asyncio.run(main())

The AsyncAnthropic client mirrors the synchronous API exactly, but every blocking call becomes await-able. The async for loop yields text chunks without blocking the event loop, so other coroutines can run between chunks.

I use async streaming in every web application because a synchronous streaming loop blocks the entire server thread for the duration of the response. With async, the server handles other requests between token deliveries. For a single-user CLI tool, synchronous streaming is fine. For anything serving multiple users, async is not optional.

The mental model: each token delivery is a brief I/O event. Between tokens, the event loop is free to handle other work — incoming HTTP requests, database queries, other streaming responses. A server handling ten concurrent streaming responses with async consumes roughly the same resources as one. With synchronous streaming, you'd need ten threads or processes.

When Not to Stream

Streaming is not universally better. There are specific scenarios where standard API calls are the right choice.

✕ Stream when
  • Users are watching the output appear
  • Response will be long (500+ tokens)
  • Low perceived latency matters
  • Building a chat or writing interface
✓ Don't stream when
  • Output is parsed as structured data (JSON)
  • Response feeds directly into code, not humans
  • You need the complete response before acting
  • Batch processing with no human observer

The JSON case deserves emphasis. If you're using the structured output techniques from the previous chapter, streaming complicates things. You can't parse JSON until you have the complete string, and streaming gives you fragments. You'd have to accumulate all chunks, wait for the stream to end, and then parse — which eliminates every benefit of streaming. For structured output pipelines, use standard API calls.

There's a theoretical exception: you could stream JSON and use an incremental JSON parser to extract complete key-value pairs as they arrive. Libraries like ijson exist for this. In practice, I've never found the complexity worthwhile. The response times for JSON outputs are usually fast enough (the responses tend to be compact and structured), and the engineering overhead of incremental parsing adds fragility without meaningful UX improvement. If no human is watching, streaming adds zero value.

The same logic applies to any task where the response is consumed by code rather than displayed to a human. Code doesn't care about perceived latency. It cares about having the complete, parseable output.

Key takeaway

Streaming is a user experience optimization, not a system performance optimization. Use it when humans are watching. Skip it when machines are consuming. Mixing the two — streaming a JSON response that gets parsed by code — gives you the worst of both worlds: the complexity of streaming with none of its perceptual benefits.

Building a Streaming Chat Interface

The real power of streaming shows up in chat interfaces. Here's a complete terminal-based chat with streaming that maintains conversation history:

import anthropic

client = anthropic.Anthropic()
history = []

def stream_chat(user_message: str) -> str:
    history.append({"role": "user", "content": user_message})
    full_response = ""

    with client.messages.stream(
        model="claude-sonnet-4-5-20250514",
        max_tokens=2048,
        system="You are a helpful technical assistant. Be concise.",
        messages=history
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_response += text

    print("\n")
    history.append({"role": "assistant", "content": full_response})
    return full_response

print("Streaming chat (type 'exit' to quit)\n")
while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ("exit", "quit"):
        break
    print("\nClaude: ", end="")
    stream_chat(user_input)

This combines the multi-turn conversation pattern from the previous chapter with streaming output. The user types a message, sees "Claude: " appear, and then watches the response stream in token by token. The full response is accumulated and appended to the conversation history for context in subsequent turns.

The crucial detail: you must accumulate the full response text and append it to the history as a complete assistant message. Streaming gives you fragments, but the conversation history needs whole messages. If you append each fragment individually, you'll end up with dozens of consecutive assistant messages in your history — which violates the alternation rule and causes an API error on the next turn.

This is the pattern I use in production: stream to the display layer chunk by chunk, accumulate in a buffer, and when the stream completes, commit the full response to the conversation history as a single message.

Streaming combined with multi-turn history is the foundation of every production chat application built on the Claude API. Master these two patterns and you have the skeleton of any conversational product.

Error Handling in Streaming

Streams can fail mid-response. Network interruptions, server errors, rate limits — any of these can terminate the stream before Claude finishes generating. Unlike standard API calls, where you get an error or a response, streaming can give you a partial response followed by an error.

import anthropic

client = anthropic.Anthropic()

accumulated = ""
try:
    with client.messages.stream(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Write a detailed explanation of consensus algorithms."}]
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            accumulated += text

    final = stream.get_final_message()
    print(f"\n\nCompleted. Tokens used: {final.usage.output_tokens}")

except anthropic.APIConnectionError:
    print(f"\n\nConnection lost after receiving: {len(accumulated)} characters")
    print("Partial response saved. Retry with the same prompt.")

except anthropic.APIStatusError as e:
    print(f"\n\nAPI error (status {e.status_code}) after receiving: {len(accumulated)} characters")

The key insight: accumulate the response as you stream it. If the connection drops, you still have whatever was received. Depending on your application, you might display the partial response, cache it for retry, or discard it and try again.

In a chat application, I usually display the partial response with a note: "(Response interrupted. Trying again...)" and then retry the request. The user sees that something went wrong but also sees that the system recovered. In a code generation tool, I discard the partial response entirely — half a function is worse than no function. The right behavior depends on whether a partial output is useful or dangerous in your specific context.

Network interruptions during streaming are more common than most developers expect. Long responses can take 15-30 seconds to stream, and that's a lot of time for a mobile connection to hiccup, a corporate proxy to timeout, or a cloud load balancer to recycle. Build the error handling before you need it.

Retry strategy for streams

If a stream fails after delivering partial content, do not retry with the same messages and prepend the partial content as an assistant message. The partial text may end mid-word or mid-sentence, and Claude would try to continue from an awkward break point. Instead, retry the original request from scratch. Streaming is fast enough that regenerating the full response is usually cheaper than stitching together fragments.

Monday-Morning Moves

Replace one blocking API call with streaming

Find a user-facing API call in your application that currently uses messages.create(). Replace it with messages.stream(). Measure the time-to-first-token versus the previous total response time. The improvement in perceived responsiveness will be immediate and obvious.

Implement the accumulation pattern

Every streaming call should accumulate the full response in a variable while displaying chunks. After the stream ends, call get_final_message() to get token usage and stop reason. Log both. You need the full text for history and the metadata for monitoring.

Add error handling around your streams

Wrap every streaming call in a try/except that catches APIConnectionError and APIStatusError. Save the accumulated partial response so you can decide what to do with it — retry, display, or discard. A stream that fails silently with a half-rendered response is worse than one that fails loudly.

Decide: stream or standard, per endpoint

Audit your API calls. Mark each one as "human-facing" (stream) or "machine-facing" (standard). Chat interfaces, writing tools, and explanation generators should stream. JSON extraction, classification, and batch processing should not. Apply the right pattern to each.

The best streaming implementation is invisible. The user doesn't think about tokens or events or SSE channels. They just see an assistant that starts talking the instant they ask — and that's exactly the point.