The trap is building the god agent. A single system prompt that handles customer support, data extraction, content generation, classification, and summarization. The prompt grows to 4,000 tokens. The instructions contradict each other. Edge cases multiply. The model starts hallucinating because the context window is so crowded with rules that it cannot find the ones relevant to the current task. I have watched teams spend months debugging behavior that was never broken -- it was just buried under too many competing instructions.
The fix is not a smarter prompt. The fix is smaller agents.
A small task agent that does one thing well is worth more than a general-purpose agent that does ten things unreliably.
Small task agents are focused, specialized components designed to excel at a single task. They do not try to solve everything. They concentrate on one specific job, which makes them simple, predictable, and easy to control. Because they handle only one task at a time, they avoid unnecessary complexity and stay efficient. The design also makes them fast and consistent -- there is no multi-step reasoning or bloated context involved.
In short, small task agents turn narrow, well-defined tasks into quick, reliable automations. And when you need complex behavior, you compose multiple small agents into a pipeline instead of cramming everything into one.
The advantages are not theoretical. They show up in production metrics within the first week.
Latency drops. A small agent with a 200-token system prompt and a focused input returns in under a second. A general-purpose agent with a 3,000-token system prompt and a conversation history takes three to five seconds. For user-facing applications, that is the difference between feeling instant and feeling sluggish.
Consistency improves. When the agent handles only one type of task -- say, extracting dates from unstructured text -- the outputs have less variance across runs. The model is not deciding between twelve possible behaviors; it is doing the one thing you told it to do.
Testing becomes possible. A narrow scope makes it trivial to write unit tests. Feed the agent ten sample inputs, assert on the outputs. If the assertions pass, the agent works. Try that with a 4,000-token general-purpose prompt and you will spend more time writing test cases than building features.
Debugging becomes sane. When a small agent produces wrong output, there are exactly two possible causes: the system prompt is wrong, or the input is wrong. When a general-purpose agent produces wrong output, the cause could be any of fifty interacting instructions.
Every agent should have one job. If you cannot describe what the agent does in a single sentence without using the word "and," the agent is too big. Split it.
Every small task agent follows the same structure. A system prompt that defines the task, an input that provides the data, and an output that returns the result. No conversation history. No multi-turn reasoning. One call in, one result out.
Here is a field extractor -- an agent that pulls structured data from unstructured text:
import anthropic
import json
client = anthropic.Anthropic()
def extract_fields(raw_text: str) -> dict:
"""Extract structured contact info from unstructured text."""
response = client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=512,
system=(
"You are a field extraction agent. Extract the following fields "
"from the provided text and return ONLY valid JSON with these keys: "
"name, email, phone, company, role. "
"If a field is not found, set its value to null. "
"Do not include any text outside the JSON object."
),
messages=[
{"role": "user", "content": raw_text}
],
)
text = response.content[0].text.strip()
# Handle models that wrap JSON in markdown code fences
if text.startswith("```"):
text = text.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return json.loads(text)
Notice the choices. Claude Haiku, not Sonnet -- the task is narrow enough that a smaller, cheaper model handles it perfectly. Max tokens set to 512, not the default 4096 -- the output is a small JSON object, not an essay. The system prompt is five lines, not fifty. And the output is parsed immediately into a Python dictionary. If the JSON is malformed, the code raises an exception at the point of failure, not three functions later when some downstream consumer tries to read a key that does not exist.
Use Claude Haiku for extraction, classification, and reformatting tasks. Use Claude Sonnet for tasks that require reasoning, nuance, or multi-step logic. Using Sonnet for field extraction is like hiring a senior engineer to sort the mail.
I keep coming back to the same five mini-agent patterns across different projects. Each one has a specific job and a specific system prompt shape.
Sorts inputs into predefined categories. Support tickets into priority levels. Customer feedback into sentiment buckets. Bug reports into component areas.
import anthropic
client = anthropic.Anthropic()
CATEGORIES = ["billing", "technical", "account", "feature_request", "other"]
def classify_ticket(ticket_text: str) -> str:
"""Classify a support ticket into a predefined category."""
response = client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=50,
system=(
"You are a ticket classifier. Classify the following support ticket "
f"into exactly one of these categories: {', '.join(CATEGORIES)}. "
"Respond with ONLY the category name, nothing else."
),
messages=[
{"role": "user", "content": ticket_text}
],
)
category = response.content[0].text.strip().lower()
if category not in CATEGORIES:
return "other"
return category
Max tokens is 50. The agent returns a single word. The fallback to "other" handles the rare case where the model returns something unexpected. Total cost per classification: fractions of a cent.
Checks text against a set of rules and returns a pass/fail verdict with an explanation. PII detection, compliance checks, brand voice validation -- any task where you need a binary decision with a reason.
import anthropic
import json
import re
client = anthropic.Anthropic()
def check_for_pii(content: str) -> dict:
"""Check content for personally identifiable information."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=300,
system=(
"You are a PII detection agent. Check the provided text for "
"personally identifiable information including: full names of "
"private individuals, email addresses, phone numbers, physical "
"addresses, government IDs, credit card numbers, and passwords. "
"Return ONLY a JSON object with two keys: "
'"pii_status" (either "pass" or "fail") and '
'"explanation" (one sentence describing your finding). '
"No markdown, no commentary."
),
messages=[
{"role": "user", "content": content}
],
)
text = response.content[0].text.strip()
try:
return json.loads(text)
except json.JSONDecodeError:
# Extract JSON from wrapped response
match = re.search(r"\{.*\}", text, re.DOTALL)
if match:
return json.loads(match.group())
return {"pii_status": "pass", "explanation": "Could not parse response"}
The classifier returns a single word. The reviewer returns a JSON verdict. The summarizer returns three sentences. Every mini-agent has a contract, and the contract is small.
This is the most sophisticated mini-agent pattern. Instead of answering a question, it decomposes a goal into structured steps. The system prompt instructs Claude to ask clarifying questions before generating the plan, creating an iterative loop:
import anthropic
client = anthropic.Anthropic()
PLANNER_SYSTEM = """You are a task planning agent. When given a goal:
1. Ask exactly 3 clarifying questions, one at a time
2. After receiving answers, generate a structured plan with phases and tasks
3. Each phase should have a title, 3-5 concrete tasks, and a deliverable
Format the final plan with clear headers and numbered tasks.
Do not generate the plan until you have answers to all 3 questions."""
def run_planner(goal: str) -> str:
"""Run an interactive task planning session."""
history = [{"role": "user", "content": f"My goal: {goal}"}]
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system=PLANNER_SYSTEM,
messages=history,
)
reply = response.content[0].text
history.append({"role": "assistant", "content": reply})
# Check if the plan has been generated (contains phase markers)
if "Phase" in reply and "Task" in reply:
return reply
# Get user's answer to the clarifying question
answer = input(f"Agent: {reply}\nYou: ")
history.append({"role": "user", "content": answer})
This is the only mini-agent pattern that uses multi-turn conversation, and it is bounded -- exactly three questions, then the output. The conversation history never grows beyond seven messages (three question-answer pairs plus the initial goal). That bounded growth is what keeps it a mini-agent instead of a general chatbot.
The real power appears when you chain small agents together. Each agent's output becomes the next agent's input. The pipeline is explicit, debuggable, and each step can be tested independently.
def process_submission(raw_text: str) -> dict:
"""Pipeline: check PII -> classify -> extract fields."""
# Step 1: Safety check
pii_result = check_for_pii(raw_text)
if pii_result["pii_status"] == "fail":
return {
"status": "rejected",
"reason": pii_result["explanation"],
}
# Step 2: Classify the content
category = classify_ticket(raw_text)
# Step 3: Extract structured fields
fields = extract_fields(raw_text)
return {
"status": "accepted",
"category": category,
"fields": fields,
}
Three agents. Three API calls. Each one costs a fraction of what a single call to a general-purpose agent would cost, and each one can be swapped, upgraded, or disabled independently. If the PII checker is too aggressive, you tune its system prompt without touching classification. If you need to add a fourth step -- say, a sentiment scorer -- you add one function and one line to the pipeline.
Build your AI system as a pipeline of small, single-purpose agents instead of one large general-purpose agent. Each agent has one job, one system prompt, and one output contract. The pipeline is the architecture. The agents are the components. This is how you get systems that are testable, debuggable, and cheap to run.
One of the biggest advantages of small agents is that they are easy to test. A general-purpose agent requires complex scenario testing because any input could trigger any behavior. A mini-agent has a contract: given this class of input, produce this shape of output. That contract is a test specification.
The testing pattern I use is straightforward. Create a list of sample inputs, run each one through the agent, and assert on the output shape and content:
def test_classifier():
test_cases = [
("I can't log into my account", "account"),
("My invoice is wrong", "billing"),
("The API returns a 500 error", "technical"),
("Can you add dark mode?", "feature_request"),
]
for text, expected in test_cases:
result = classify_ticket(text)
assert result in CATEGORIES, f"Invalid category: {result}"
print(f"'{text[:40]}...' -> {result} (expected {expected})")
You do not need to assert that the classification is always correct -- models are probabilistic, and edge cases will shift between runs. What you assert is that the output is always valid: a member of the allowed set, parseable JSON, a non-empty string. If the output shape is ever wrong, the system prompt needs revision.
For the content reviewer, the test is even simpler: feed it text with known PII and confirm it returns "fail". Feed it clean text and confirm it returns "pass". If either assertion breaks, you know exactly which agent has the problem and exactly which system prompt to fix.
This is the debugging advantage that pays for itself every week. When a general-purpose agent misbehaves, you search through thousands of tokens of instructions trying to find the conflicting rule. When a mini-agent misbehaves, you read five lines of system prompt and the answer is usually obvious.
Mini-agents are not the right tool for everything. If the task genuinely requires deep reasoning across multiple domains -- say, analyzing a legal contract for both financial risk and regulatory compliance -- splitting it into two agents that cannot see each other's work produces worse results than a single, well-prompted call with a larger model.
The heuristic I use: if the subtasks are independent (classify, then extract, then summarize), use a pipeline. If the subtasks are interdependent (the financial analysis informs the regulatory assessment, which changes the financial analysis), use a single agent with a longer prompt.
Before splitting a task into mini-agents, ask: does agent B need to see agent A's reasoning, or just agent A's output? If it needs the reasoning, keep them together. If it only needs the output, split them.
Find the longest system prompt in your codebase. Count the number of distinct tasks it handles. If the count is greater than two, that agent is a candidate for decomposition.
Take one classification task -- ticket routing, sentiment detection, priority assignment -- and build it as a standalone mini-agent with Claude Haiku. Measure the latency and cost difference.
Add a content review agent to any pipeline that processes user-generated text. Run it before any other processing. If PII is detected, reject the input before it reaches your main logic.
For every mini-agent, document the exact output format: "returns a JSON object with keys X, Y, Z" or "returns a single word from this list." Parse and validate the output immediately after the API call. Never pass raw model output to the next stage without validation.
Chain two or three mini-agents together in a function. Feed the output of one into the input of the next. Run ten test inputs through the pipeline and verify each stage independently. This is your production architecture in miniature.
The pipeline is the architecture. The agents are the components. Composability is the engineering discipline that makes AI systems production-grade.