Get Started in 60 Seconds
Install the Agent SDK and set your API key:
pip install openai-agents anthropic
export ANTHROPIC_API_KEY="your-key-here"
Create agent.py and paste this:
from agents import Agent, Runner, function_tool
@function_tool
def get_weather(city: str) -> str:
"""Get the current weather for a city."""
return f"The weather in {city} is 22°C and sunny."
agent = Agent(
name="assistant",
model="claude-sonnet-4-6",
instructions="You are a helpful assistant. Use tools when needed.",
tools=[get_weather],
)
result = Runner.run_sync(agent, "What is the weather in London?")
print(result.final_output)
Run it:
python agent.py
# Output: The weather in London is 22°C and sunny.
That is a working agent with tool use. The openai-agents package provides the agents module. The SDK handles the agentic loop (send message, detect tool call, execute tool, return result, repeat) so you write only the tool and the instructions.
The rest of this guide builds a production-ready agent with real tools, guardrails, multi-turn conversations, and observability. By the end, you will have a project management agent that creates tasks, checks status, assigns work, and handles the edge cases that real applications encounter.
Why Use the Agent SDK
Building agents with raw API calls means writing the agentic loop yourself: send a message, check for tool calls, parse them, execute them, send results back, and repeat. The Agent SDK handles all of that, plus conversation management and error handling, so you focus on defining tools and instructions.
| Capability | Raw Anthropic SDK (anthropic-sdk-python) | Agent SDK (openai-agents) | LangChain / LangGraph (langchain) |
|---|---|---|---|
| Direct Messages API access | built-in | wrapped | wrapped |
| Managed agentic loop | you write it | built-in (Runner.run) |
built-in (AgentExecutor, LangGraph) |
| Tool definition via Python decorators | n/a | @function_tool |
@tool |
| Input and output guardrails as first-class config | n/a | InputGuardrail, OutputGuardrail |
via callbacks / parsers |
| Agent handoffs built in | n/a | handoffs=[...] |
via LangGraph state machine |
| Lifecycle hooks (on_tool_start, on_tool_end) | n/a | RunHooks |
BaseCallbackHandler |
| Vendor lock-in | Anthropic only | multi-provider (Anthropic, OpenAI, others) | multi-provider |
| Dependency count on install | 1 (anthropic) |
2 (openai-agents, anthropic) |
20+ transitive via langchain, langchain-core, langchain-community |
Data source: openai-agents on PyPI, anthropic-sdk-python on GitHub, and langchain on PyPI, as of 2026-04.
The Problem
Every agent shares the same core challenge. You need to give an LLM the ability to take actions in the real world, while keeping control over what those actions are, how they execute, and what happens when they fail.
The raw approach means writing the agentic loop yourself. Send a message to the API. Check if the response contains tool calls. If it does, parse them, execute them, send the results back, and check again. Handle errors at each step.
Manage conversation history. Track token usage. Implement timeouts. Add logging.
That loop is not complex in principle. It is tedious in practice. And the tedium creates bugs.
A missed error handler. A conversation history that grows without bound. A tool call parser that breaks on edge cases.
The Agent SDK eliminates the tedium. It gives you a tested, maintained implementation of the agentic loop, and it lets you define the interesting parts, the tools, the instructions, the guardrails, as simple Python functions and configuration.
Here is how.
The Journey: Build Your AI Agent Step by Step
What We Are Building
The agent we are building is called TaskBot. It manages a simple project board with these capabilities.
It can create tasks with a title, description, priority, and assignee. It can list tasks filtered by status or assignee. It can update task status. It can send notifications when tasks change. And it can provide summaries of project progress.
This is a realistic use case. It touches databases, external APIs, and business logic. It requires multi-turn conversations where the user asks follow-up questions. And it needs guardrails to prevent misuse.
The full code is at the end of this guide. We will build it piece by piece so you understand each decision.
Setting Up the Project
Start by creating a project directory and installing the SDK.
mkdir taskbot && cd taskbot
python -m venv venv
source venv/bin/activate
pip install openai-agents anthropic
The Agent SDK is distributed as the openai-agents package, which supports multiple model providers including Claude through the Anthropic integration. You will also need an Anthropic API key.
export ANTHROPIC_API_KEY="your-key-here"
Create the project structure.
taskbot/
agent.py # Agent definition
tools.py # Tool functions
guardrails.py # Input/output validation
models.py # Data models
database.py # Storage layer
main.py # Entry point
This structure separates concerns. Tools are pure functions that interact with the database. The agent definition is configuration. Guardrails are validation logic. The entry point ties everything together.
Defining the Data Models
Before writing tools, define what you are working with. This example uses Python dataclasses for simplicity, but Pydantic models work well here too.
# models.py
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional
class TaskStatus(Enum):
TODO = "todo"
IN_PROGRESS = "in_progress"
REVIEW = "review"
DONE = "done"
class Priority(Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
@dataclass
class Task:
id: str
title: str
description: str
status: TaskStatus = TaskStatus.TODO
priority: Priority = Priority.MEDIUM
assignee: Optional[str] = None
created_at: datetime = field(default_factory=datetime.now)
updated_at: datetime = field(default_factory=datetime.now)
Building a Simple Storage Layer
For this tutorial, an in-memory database keeps things simple. In production, you would swap this for PostgreSQL, SQLite, or whatever your application uses. The agent does not care about the storage implementation. It only interacts with tools.
# database.py
from datetime import datetime
from models import Task, TaskStatus
from typing import Optional
import uuid
_tasks: dict[str, Task] = {}
def create_task(title: str, description: str,
priority: str = "medium",
assignee: Optional[str] = None) -> Task:
task_id = str(uuid.uuid4())[:8]
task = Task(
id=task_id,
title=title,
description=description,
priority=priority,
assignee=assignee
)
_tasks[task_id] = task
return task
def get_task(task_id: str) -> Optional[Task]:
return _tasks.get(task_id)
def list_tasks(status: Optional[str] = None,
assignee: Optional[str] = None) -> list[Task]:
tasks = list(_tasks.values())
if status:
tasks = [t for t in tasks if t.status.value == status]
if assignee:
tasks = [t for t in tasks if t.assignee == assignee]
return tasks
def update_task_status(task_id: str, new_status: str) -> Optional[Task]:
task = _tasks.get(task_id)
if task:
task.status = TaskStatus(new_status)
task.updated_at = datetime.now()
return task
Defining Tools
Tools are where the agent meets the real world. Each tool is a Python function decorated with @function_tool. The function's name becomes the tool's name. The docstring becomes the tool's description, which Claude reads to understand when and how to use the tool. Type hints become the parameter schema.
# tools.py
from agents import function_tool
from database import create_task, get_task, list_tasks, update_task_status
from typing import Optional
@function_tool
def create_new_task(title: str, description: str,
priority: str = "medium",
assignee: Optional[str] = None) -> str:
"""Create a new task on the project board.
Args:
title: Short title for the task
description: Detailed description of what needs to be done
priority: One of 'low', 'medium', 'high', or 'critical'
assignee: Name of the person to assign the task to
"""
task = create_task(title, description, priority, assignee)
return (
f"Created task {task.id}: '{task.title}' "
f"(priority: {priority}, assignee: {assignee or 'unassigned'})"
)
@function_tool
def list_project_tasks(status: Optional[str] = None,
assignee: Optional[str] = None) -> str:
"""List tasks on the project board, optionally filtered.
Args:
status: Filter by status ('todo', 'in_progress', 'review', 'done')
assignee: Filter by assignee name
"""
tasks = list_tasks(status, assignee)
if not tasks:
return "No tasks found matching the criteria."
lines = []
for t in tasks:
lines.append(
f"[{t.id}] {t.title} | "
f"Status: {t.status.value} | "
f"Priority: {t.priority} | "
f"Assignee: {t.assignee or 'unassigned'}"
)
return "\n".join(lines)
@function_tool
def update_status(task_id: str, new_status: str) -> str:
"""Update the status of an existing task.
Args:
task_id: The ID of the task to update
new_status: New status ('todo', 'in_progress', 'review', 'done')
"""
task = update_task_status(task_id, new_status)
if task:
return f"Updated task {task_id} to status '{new_status}'."
return f"Task {task_id} not found."
@function_tool
def get_task_details(task_id: str) -> str:
"""Get full details of a specific task.
Args:
task_id: The ID of the task to look up
"""
task = get_task(task_id)
if not task:
return f"Task {task_id} not found."
return (
f"Task {task.id}\n"
f"Title: {task.title}\n"
f"Description: {task.description}\n"
f"Status: {task.status.value}\n"
f"Priority: {task.priority}\n"
f"Assignee: {task.assignee or 'unassigned'}\n"
f"Created: {task.created_at.isoformat()}\n"
f"Updated: {task.updated_at.isoformat()}"
)
@function_tool
def send_notification(recipient: str, message: str) -> str:
"""Send a notification to a team member.
Args:
recipient: Name of the person to notify
message: The notification message
"""
# In production, this would call an email/Slack/Teams API
print(f"[NOTIFICATION to {recipient}]: {message}")
return f"Notification sent to {recipient}."
Notice that each tool returns a string. The SDK sends this string back to Claude as the tool result. Claude uses it to formulate its response. Clear, informative return values make Claude's responses better.
The docstrings matter enormously. Claude reads them to decide which tool to use and how to call it. A vague docstring produces vague tool usage. A precise docstring with documented parameters produces precise tool calls.
Creating the Agent
With tools defined, the agent itself is straightforward.
# agent.py
from agents import Agent
from tools import (
create_new_task,
list_project_tasks,
update_status,
get_task_details,
send_notification
)
taskbot = Agent(
name="TaskBot",
model="claude-sonnet-4-6",
instructions="""You are TaskBot, a project management assistant.
You help teams manage their tasks and stay organised.
Guidelines:
- When creating tasks, always confirm the details with the user first
- Use 'medium' priority unless the user specifies otherwise
- When updating task status, notify the assignee
- Provide concise summaries when listing tasks
- If a task ID is not found, suggest listing all tasks
- Always be helpful but never create tasks without explicit user request""",
tools=[
create_new_task,
list_project_tasks,
update_status,
get_task_details,
send_notification
]
)
The instructions field is your system prompt. It shapes how the agent behaves across all conversations. Write it like you are briefing a new team member. Be specific about what the agent should and should not do.
The model field determines which Claude model powers the agent. Use claude-sonnet-4-6 for a good balance of speed and capability. Use claude-opus-4-6 for tasks that require deeper reasoning. Use claude-haiku-4-6 for simple, high-volume tasks where speed matters most. For a broader view of how Claude Code's capabilities stack up against other AI development tools, see our Claude Code vs Cursor comparison.
Understanding the Agentic Loop
When you call Runner.run_sync(taskbot, "Create a task for the API migration"), this is what happens internally.
First, the SDK sends your message to Claude along with the system instructions and the tool definitions. Claude receives everything it needs to understand who it is, what it can do, and what the user wants.
Second, Claude responds. If it decides to call a tool, the response contains a tool use block with the tool name and parameters. If it decides to respond directly, the response contains text.
Third, if there was a tool call, the SDK executes your tool function with the provided parameters. It captures the return value as a string.
Fourth, the SDK sends the tool result back to Claude. Claude now knows what happened and can decide to call another tool, ask a follow-up question, or provide a final response.
This loop continues until Claude produces a response with no tool calls. That response becomes the final_output of the run.
The key insight is that you never write this loop. The SDK handles it. Your job is to define the tools and instructions that shape what happens inside the loop.
| Step | Actor | What happens | Data passed |
|---|---|---|---|
| 1 | Runner | Sends system prompt, tool schemas, and the user message to Claude | Messages array plus tool definitions |
| 2 | Claude | Returns either a final text response or one or more tool_use blocks |
Assistant message with stop_reason |
| 3 | Runner | Invokes each Python function matched to a tool_use block and captures the return value |
Tool name plus arguments dict |
| 4 | Runner | Appends a tool_result block and re-sends the full conversation to Claude |
Tool result string keyed to tool_use_id |
| 5 | Loop | Steps 2 to 4 repeat until Claude returns stop_reason: "end_turn" or the max_turns cap is hit |
RunResult.final_output |
Data source: Anthropic tool use docs and Agent SDK overview, as of 2026-04.
Multi-Turn Conversations
A single Runner.run_sync() call handles one turn of conversation. For multi-turn interactions where the user asks follow-up questions, you need to maintain the conversation history.
# main.py
from agents import Runner
from agent import taskbot
async def chat():
history = []
while True:
user_input = input("\nYou: ")
if user_input.lower() in ("quit", "exit"):
break
# Add user message to history
history.append({
"role": "user",
"content": user_input
})
result = await Runner.run(
taskbot,
history
)
# Add assistant response to history
history.append({
"role": "assistant",
"content": result.final_output
})
print(f"\nTaskBot: {result.final_output}")
if __name__ == "__main__":
import asyncio
asyncio.run(chat())
The conversation history is a list of message objects. Each message has a role (user or assistant) and content. The SDK sends the full history with each request, giving Claude the context of the entire conversation.
Be mindful of history length. Each message consumes tokens. For long-running sessions, you may need to summarise older messages or implement a sliding window that keeps only the most recent exchanges.
Adding Guardrails
Guardrails are functions that validate inputs before the agent processes them and outputs before the agent returns them. They are your safety layer.
# guardrails.py
from agents import (
InputGuardrail,
OutputGuardrail,
GuardrailFunctionOutput,
Agent,
Runner
)
BLOCKED_TERMS = [
"delete all", "drop table", "remove everything",
"fire everyone", "terminate all"
]
async def validate_input(ctx, agent, user_input):
"""Block potentially destructive or harmful requests."""
lower_input = user_input.lower() if isinstance(user_input, str) else ""
triggered = any(term in lower_input for term in BLOCKED_TERMS)
return GuardrailFunctionOutput(
output_info={
"blocked_term_found": triggered,
"input_length": len(lower_input)
},
tripwire_triggered=triggered
)
async def validate_output(ctx, agent, output):
"""Ensure the agent does not leak internal details."""
lower_output = output.lower() if isinstance(output, str) else ""
leaks_internals = any(
term in lower_output
for term in ["api_key", "database_url", "internal_secret"]
)
return GuardrailFunctionOutput(
output_info={"leaks_internals": leaks_internals},
tripwire_triggered=leaks_internals
)
Now add the guardrails to the agent definition.
from guardrails import validate_input, validate_output
from agents import InputGuardrail, OutputGuardrail
taskbot = Agent(
name="TaskBot",
model="claude-sonnet-4-6",
instructions="...",
tools=[...],
input_guardrails=[
InputGuardrail(guardrail_function=validate_input)
],
output_guardrails=[
OutputGuardrail(guardrail_function=validate_output)
]
)
When a guardrail trips, the SDK raises an exception that you catch in your application code. The agent never sees the blocked input. The user gets a clear error message.
from agents.exceptions import InputGuardrailTripwireTriggered
try:
result = await Runner.run(taskbot, user_input)
except InputGuardrailTripwireTriggered:
print("That request was blocked by our safety policy.")
We recommend running guardrails on every agent in production. The overhead is minimal, typically a few milliseconds for pattern matching. The protection is significant. A guardrail that catches one destructive request has paid for itself permanently.
Error Handling
Tools fail. APIs time out. Databases go down. Your agent needs to handle these failures gracefully.
The simplest approach is to handle errors within the tool function itself.
@function_tool
def create_new_task(title: str, description: str,
priority: str = "medium",
assignee: Optional[str] = None) -> str:
"""Create a new task on the project board."""
try:
# Validate priority
valid_priorities = ["low", "medium", "high", "critical"]
if priority not in valid_priorities:
return (
f"Invalid priority '{priority}'. "
f"Must be one of: {', '.join(valid_priorities)}"
)
task = create_task(title, description, priority, assignee)
return (
f"Created task {task.id}: '{task.title}' "
f"(priority: {priority})"
)
except Exception as e:
return f"Failed to create task: {str(e)}"
By returning error messages as strings rather than raising exceptions, you let Claude handle the failure conversationally. Claude sees "Failed to create task: database connection timeout" and can tell the user what happened, suggest a retry, or try an alternative approach.
For critical failures that should stop the agent entirely, raise an exception. The SDK will stop the agentic loop and propagate the error to your application code.
You should also set timeouts at the runner level to prevent agents from running indefinitely.
result = await Runner.run(
taskbot,
user_input,
max_turns=10 # Stop after 10 tool call cycles
)
The max_turns parameter prevents infinite loops where the agent keeps calling tools without reaching a conclusion. Ten turns is a reasonable default for most agents. Increase it for agents that need to perform many sequential operations.
Agent Handoffs
Sometimes a single agent cannot handle everything. The Agent SDK supports handoffs, where one agent delegates to another for specialised tasks.
Imagine TaskBot needs to handle both project management and time tracking. Instead of cramming both into one agent, you create two specialised agents and let them hand off to each other.
from agents import Agent
time_tracker = Agent(
name="TimeTracker",
model="claude-sonnet-4-6",
instructions="""You track time spent on tasks.
You can log hours, view time reports, and calculate
utilisation rates. For task management questions,
hand off to the TaskBot agent.""",
tools=[log_time, get_time_report, calculate_utilisation],
handoffs=[] # Will be set after taskbot is defined
)
taskbot = Agent(
name="TaskBot",
model="claude-sonnet-4-6",
instructions="""You manage project tasks.
For time tracking questions, hand off to the
TimeTracker agent.""",
tools=[
create_new_task,
list_project_tasks,
update_status,
get_task_details,
send_notification
],
handoffs=[time_tracker]
)
# Complete the circular reference
time_tracker.handoffs = [taskbot]
When a user asks TaskBot "How many hours did Sarah log this week?", TaskBot recognises this is a time tracking question and hands off to TimeTracker. TimeTracker handles the request with its specialised tools and returns the result.
This pattern keeps each agent focused. Focused agents are easier to test, easier to debug, and produce better results because their instructions and tools are not diluted by unrelated capabilities.
Observability and Monitoring
In production, you need to know what your agent is doing. The SDK provides hooks that let you observe every step of the agentic loop.
from agents import RunHooks, RunContextWrapper, Tool, Agent
from datetime import datetime
class ProductionHooks(RunHooks):
def __init__(self):
self.tool_calls = []
self.start_time = None
self.total_tokens = 0
async def on_agent_start(self, context, agent):
self.start_time = datetime.now()
print(f"[{self.start_time}] Agent '{agent.name}' started")
async def on_tool_start(self, context, agent, tool):
print(f" [TOOL CALL] {tool.name}")
async def on_tool_end(self, context, agent, tool, result):
self.tool_calls.append({
"tool": tool.name,
"timestamp": datetime.now().isoformat(),
"result_length": len(str(result))
})
print(f" [TOOL DONE] {tool.name} ({len(str(result))} chars)")
async def on_agent_end(self, context, agent, output):
duration = (datetime.now() - self.start_time).total_seconds()
print(f"[COMPLETE] {len(self.tool_calls)} tool calls "
f"in {duration:.1f}s")
hooks = ProductionHooks()
result = await Runner.run(
taskbot,
user_input,
run_hooks=hooks
)
These hooks give you structured data about every agent execution. In production, send this data to a logging service for analysis. Common things to track include the number of tool calls per conversation, which tools are used most frequently, average execution time, error rates, and token consumption patterns.
If you are building agents that connect to external services through MCP servers, observability becomes even more important. MCP tool calls cross network boundaries, so you need to track latency and failures at each hop. When these connections involve authentication or sensitive data, follow our MCP authentication and security best practices.
Production Patterns
Several patterns have proven essential in production agents.
Async execution. The SDK supports async natively. Use Runner.run() instead of Runner.run_sync() in production to avoid blocking your application's event loop.
result = await Runner.run(taskbot, user_input)
Rate limiting. If your agent handles multiple users, implement rate limiting to avoid API quota exhaustion.
import asyncio
from collections import defaultdict
from time import time
class RateLimiter:
def __init__(self, max_requests_per_minute=20):
self.max_rpm = max_requests_per_minute
self.requests = defaultdict(list)
async def check(self, user_id: str):
now = time()
# Clean old entries
self.requests[user_id] = [
t for t in self.requests[user_id]
if now - t < 60
]
if len(self.requests[user_id]) >= self.max_rpm:
raise Exception("Rate limit exceeded. Please wait.")
self.requests[user_id].append(now)
limiter = RateLimiter()
async def handle_request(user_id: str, message: str):
await limiter.check(user_id)
result = await Runner.run(taskbot, message)
return result.final_output
Cost tracking. Every agent call consumes tokens. Track usage to avoid surprises on your bill. The same model selection and context management habits that apply to interactive Claude Code sessions apply to agents; our guide on Claude Code cost optimisation covers the full set of techniques.
Worked example: a 5-turn TaskBot conversation that creates two tasks and lists them. Input counts include the system prompt, tool schemas, and full conversation history re-sent on each turn. Output counts cover Claude's text responses and tool_use blocks.
| Turn | Scenario | Input tokens | Output tokens |
|---|---|---|---|
| 1 | User asks to create a task, Claude emits tool_use |
1,800 | 120 |
| 2 | Runner returns tool_result, Claude confirms creation |
2,100 | 90 |
| 3 | User asks to create a second task, Claude emits tool_use |
2,400 | 130 |
| 4 | Runner returns tool_result, Claude confirms creation |
2,700 | 100 |
| 5 | User asks "show me all tasks", Claude calls list_tasks then replies |
3,100 | 260 |
| Total | 12,100 | 700 |
Applying published Anthropic list prices (claude-sonnet-4-6: $3 per million input tokens, $15 per million output tokens; claude-haiku-4-6: $0.80 per million input tokens, $4 per million output tokens):
| Model | Input cost | Output cost | Total per conversation |
|---|---|---|---|
| claude-sonnet-4-6 | $0.0363 | $0.0105 | $0.0468 |
| claude-haiku-4-6 | $0.0097 | $0.0028 | $0.0125 |
A high-volume TaskBot handling 10,000 conversations per month on Sonnet costs about $468. The same traffic on Haiku costs about $125. Prompt caching cuts repeated-system-prompt input cost by up to 90% on cached reads.
Data source: Anthropic pricing page, Messages API reference, and prompt caching docs, as of 2026-04. Token counts are approximations for a realistic TaskBot run; actual usage varies with tool schemas and system prompt length.
class CostTracker(RunHooks):
def __init__(self):
self.total_input_tokens = 0
self.total_output_tokens = 0
async def on_agent_end(self, context, agent, output):
usage = getattr(context, 'usage', None)
if usage:
self.total_input_tokens += usage.input_tokens
self.total_output_tokens += usage.output_tokens
cost = (
self.total_input_tokens * 0.003 / 1000 +
self.total_output_tokens * 0.015 / 1000
)
print(f"Session cost so far: ${cost:.4f}")
Testing Your Agent
Testing agents requires two layers. Unit tests for individual tools and integration tests for the full agent loop.
Unit testing tools. Since tools are regular Python functions (wrapped with a decorator), you can test them directly.
# test_tools.py
import pytest
from database import create_task, list_tasks, _tasks
def setup_function():
"""Clear the database before each test."""
_tasks.clear()
def test_create_task():
task = create_task(
"Fix login bug",
"Users cannot log in with SSO",
"high",
"Sarah"
)
assert task.title == "Fix login bug"
assert task.priority == "high"
assert task.assignee == "Sarah"
assert task.id is not None
def test_list_tasks_filter_by_status():
create_task("Task 1", "Description", "medium")
task2 = create_task("Task 2", "Description", "high")
task2.status = TaskStatus.DONE
todo_tasks = list_tasks(status="todo")
assert len(todo_tasks) == 1
assert todo_tasks[0].title == "Task 1"
def test_list_tasks_empty():
tasks = list_tasks()
assert tasks == []
Integration testing the agent. Use the SDK to run the agent against test inputs and verify the outputs.
# test_agent.py
import pytest
from agents import Runner
from agent import taskbot
from database import _tasks
@pytest.fixture(autouse=True)
def clear_db():
_tasks.clear()
yield
_tasks.clear()
@pytest.mark.asyncio
async def test_agent_creates_task():
result = await Runner.run(
taskbot,
"Create a high priority task called 'Deploy v2.0' "
"about deploying the new version to production"
)
assert len(_tasks) == 1
task = list(_tasks.values())[0]
assert "Deploy" in task.title
@pytest.mark.asyncio
async def test_agent_handles_unknown_task():
result = await Runner.run(
taskbot,
"Show me details for task xyz123"
)
assert "not found" in result.final_output.lower()
@pytest.mark.asyncio
async def test_guardrail_blocks_destructive_input():
from agents.exceptions import InputGuardrailTripwireTriggered
with pytest.raises(InputGuardrailTripwireTriggered):
await Runner.run(taskbot, "Delete all tasks immediately")
Integration tests are slower because they make API calls. Run them in a separate test suite and use environment variables to point them at a test API key with lower rate limits.
The Complete Working Example
Here is the full TaskBot agent in a single file, ready to run.
#!/usr/bin/env python3
"""TaskBot - A project management agent built with the Claude Agent SDK."""
import asyncio
import uuid
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from agents import (
Agent,
Runner,
function_tool,
InputGuardrail,
GuardrailFunctionOutput,
RunHooks,
)
# --- Data Models ---
@dataclass
class Task:
id: str
title: str
description: str
status: str = "todo"
priority: str = "medium"
assignee: Optional[str] = None
created_at: str = field(
default_factory=lambda: datetime.now().isoformat()
)
# --- Database ---
tasks_db: dict[str, Task] = {}
# --- Tools ---
@function_tool
def create_task(title: str, description: str,
priority: str = "medium",
assignee: Optional[str] = None) -> str:
"""Create a new task on the project board.
Args:
title: Short title for the task
description: What needs to be done
priority: low, medium, high, or critical
assignee: Person to assign the task to
"""
valid = ["low", "medium", "high", "critical"]
if priority not in valid:
return f"Invalid priority. Must be one of: {', '.join(valid)}"
task_id = str(uuid.uuid4())[:8]
task = Task(
id=task_id, title=title, description=description,
priority=priority, assignee=assignee
)
tasks_db[task_id] = task
return (
f"Created task {task_id}: '{title}' "
f"(priority: {priority}, "
f"assignee: {assignee or 'unassigned'})"
)
@function_tool
def list_tasks(status: Optional[str] = None,
assignee: Optional[str] = None) -> str:
"""List tasks, optionally filtered by status or assignee.
Args:
status: Filter by todo, in_progress, review, or done
assignee: Filter by assignee name
"""
filtered = list(tasks_db.values())
if status:
filtered = [t for t in filtered if t.status == status]
if assignee:
filtered = [t for t in filtered if t.assignee == assignee]
if not filtered:
return "No tasks found."
lines = []
for t in filtered:
lines.append(
f"[{t.id}] {t.title} | {t.status} | "
f"{t.priority} | {t.assignee or 'unassigned'}"
)
return "\n".join(lines)
@function_tool
def update_task(task_id: str, new_status: str) -> str:
"""Update the status of a task.
Args:
task_id: The task ID
new_status: New status (todo, in_progress, review, done)
"""
valid = ["todo", "in_progress", "review", "done"]
if new_status not in valid:
return f"Invalid status. Must be one of: {', '.join(valid)}"
task = tasks_db.get(task_id)
if not task:
return f"Task {task_id} not found."
old_status = task.status
task.status = new_status
return (
f"Updated task {task_id} from '{old_status}' "
f"to '{new_status}'."
)
@function_tool
def get_task(task_id: str) -> str:
"""Get full details of a task.
Args:
task_id: The task ID to look up
"""
task = tasks_db.get(task_id)
if not task:
return f"Task {task_id} not found."
return (
f"Task: {task.id}\n"
f"Title: {task.title}\n"
f"Description: {task.description}\n"
f"Status: {task.status}\n"
f"Priority: {task.priority}\n"
f"Assignee: {task.assignee or 'unassigned'}\n"
f"Created: {task.created_at}"
)
@function_tool
def notify(recipient: str, message: str) -> str:
"""Send a notification to a team member.
Args:
recipient: Person to notify
message: Notification message
"""
print(f" [NOTIFY {recipient}]: {message}")
return f"Notification sent to {recipient}."
# --- Guardrails ---
async def check_input(ctx, agent, user_input):
blocked = ["delete all", "remove everything", "drop table"]
text = user_input.lower() if isinstance(user_input, str) else ""
triggered = any(term in text for term in blocked)
return GuardrailFunctionOutput(
output_info={"blocked": triggered},
tripwire_triggered=triggered
)
# --- Hooks ---
class AgentLogger(RunHooks):
async def on_tool_start(self, context, agent, tool):
print(f" > Calling {tool.name}...")
async def on_tool_end(self, context, agent, tool, result):
preview = str(result)[:80]
print(f" < {tool.name} returned: {preview}")
# --- Agent ---
taskbot = Agent(
name="TaskBot",
model="claude-sonnet-4-6",
instructions="""You are TaskBot, a project management assistant.
Rules:
- Confirm details before creating tasks
- Default to medium priority unless told otherwise
- Notify assignees when their tasks change status
- Be concise and helpful
- Never create tasks without an explicit request""",
tools=[create_task, list_tasks, update_task, get_task, notify],
input_guardrails=[
InputGuardrail(guardrail_function=check_input)
]
)
# --- Main ---
async def main():
print("TaskBot ready. Type 'quit' to exit.\n")
history = []
hooks = AgentLogger()
while True:
user_input = input("You: ")
if user_input.lower() in ("quit", "exit"):
break
history.append({"role": "user", "content": user_input})
try:
result = await Runner.run(
taskbot, history,
run_hooks=hooks, max_turns=10
)
response = result.final_output
history.append(
{"role": "assistant", "content": response}
)
print(f"\nTaskBot: {response}\n")
except Exception as e:
print(f"\nError: {e}\n")
if __name__ == "__main__":
asyncio.run(main())
Save this as taskbot.py, set your ANTHROPIC_API_KEY, and run it with python taskbot.py. You will have a working project management agent that creates tasks, tracks status, sends notifications, and blocks destructive inputs.
Troubleshooting Common Errors
Building agents involves moving parts that fail in predictable ways. This section covers the errors you will actually encounter, with the exact messages and fixes.
"ANTHROPIC_API_KEY is not set"
This appears when the SDK cannot find your API key. The SDK checks the ANTHROPIC_API_KEY environment variable at runtime.
# Verify the key is set
echo $ANTHROPIC_API_KEY
# If empty, set it
export ANTHROPIC_API_KEY="sk-ant-..."
On macOS and Linux, environment variables set with export only persist for the current shell session. For permanent configuration, add the export to your ~/.bashrc, ~/.zshrc, or use a .env file with python-dotenv.
A common mistake is setting ANTHROPIC_API_KEY in one terminal and running the agent in another. Each terminal session has its own environment.
"Tool function must return a string"
The SDK expects every tool function to return a str. If your tool returns None (for example, a function that performs an action but has no explicit return statement), the SDK raises this error.
# Wrong: implicit None return
@function_tool
def delete_item(item_id: str):
"""Delete an item."""
database.delete(item_id)
# Correct: always return a string
@function_tool
def delete_item(item_id: str) -> str:
"""Delete an item."""
database.delete(item_id)
return f"Item {item_id} deleted."
"Maximum turns exceeded"
This means the agent hit the max_turns limit without producing a final response. It usually indicates one of two problems. Either the agent is stuck in a loop (calling the same tool repeatedly with the same arguments) or the task genuinely requires more steps than the limit allows.
# Diagnose with hooks
class DebugHooks(RunHooks):
async def on_tool_start(self, context, agent, tool):
print(f"Turn: calling {tool.name}")
result = await Runner.run(
agent, message,
max_turns=10,
run_hooks=DebugHooks()
)
If the logs show the same tool call repeating, the issue is usually a tool that returns ambiguous results. Claude cannot determine whether the action succeeded, so it tries again. Make tool return values explicit about success or failure.
"InputGuardrailTripwireTriggered" in Production
When a guardrail trips, the SDK raises InputGuardrailTripwireTriggered. If you do not catch this exception, your application crashes. Always wrap Runner.run() calls in try/except blocks in production code.
from agents.exceptions import (
InputGuardrailTripwireTriggered,
OutputGuardrailTripwireTriggered
)
try:
result = await Runner.run(agent, user_input)
except InputGuardrailTripwireTriggered as e:
# Log the blocked input for security review
logger.warning(f"Input blocked: {e}")
return "That request cannot be processed."
except OutputGuardrailTripwireTriggered as e:
logger.warning(f"Output blocked: {e}")
return "The response was filtered for safety."
Rate Limit Errors (429)
The Anthropic API returns HTTP 429 when you exceed your rate limit. The SDK does not automatically retry. You need to handle this yourself.
import asyncio
from anthropic import RateLimitError
async def run_with_retry(agent, message, max_retries=3):
for attempt in range(max_retries):
try:
return await Runner.run(agent, message)
except RateLimitError:
wait_time = 2 ** attempt # 1s, 2s, 4s
print(f"Rate limited. Retrying in {wait_time}s...")
await asyncio.sleep(wait_time)
raise Exception("Max retries exceeded")
JSON Serialisation Errors in Tool Arguments
Claude occasionally sends tool arguments that do not match the expected types. A parameter typed as int might receive a string like "42". The SDK validates types before calling your function, but edge cases exist with complex nested types.
Keep tool parameter types simple. Use str, int, float, bool, and Optional variants. Avoid deeply nested Pydantic models as tool parameters. If you need complex input, accept a JSON string and parse it inside the tool function.
Production Deployment Considerations
Moving from a local script to a production deployment introduces several concerns beyond the code itself.
Environment Configuration
Separate your agent configuration from your code. Model names, API keys, rate limits, and feature flags should come from environment variables or a configuration file, not hardcoded values.
import os
MODEL = os.environ.get("AGENT_MODEL", "claude-sonnet-4-6")
MAX_TURNS = int(os.environ.get("AGENT_MAX_TURNS", "10"))
MAX_RPM = int(os.environ.get("AGENT_MAX_RPM", "20"))
agent = Agent(
name="TaskBot",
model=MODEL,
instructions="...",
tools=[...],
)
This lets you run the same code with claude-haiku-4-6 in development (faster, cheaper) and claude-sonnet-4-6 in production without changing any code.
Conversation History Management
In production, conversation histories grow unbounded. A user who interacts with your agent for an hour can accumulate enough history to exceed Claude's context window. Implement a sliding window or summarisation strategy.
MAX_HISTORY_MESSAGES = 20
async def chat_with_limit(agent, history, user_input):
history.append({"role": "user", "content": user_input})
# Trim to last N messages, keeping the system context
if len(history) > MAX_HISTORY_MESSAGES:
history = history[-MAX_HISTORY_MESSAGES:]
result = await Runner.run(agent, history)
history.append({"role": "assistant", "content": result.final_output})
return result.final_output
For more sophisticated approaches, summarise older messages into a single context message that captures the key facts from the conversation so far. This preserves important context while keeping token usage predictable.
Structured Logging for Debugging
Print statements work during development. In production, use structured logging that includes request IDs, user IDs, and timing information.
import logging
import json
from datetime import datetime
logger = logging.getLogger("agent")
class StructuredHooks(RunHooks):
def __init__(self, request_id: str, user_id: str):
self.request_id = request_id
self.user_id = user_id
self.start_time = None
async def on_agent_start(self, context, agent):
self.start_time = datetime.now()
logger.info(json.dumps({
"event": "agent_start",
"request_id": self.request_id,
"user_id": self.user_id,
"agent": agent.name,
}))
async def on_tool_end(self, context, agent, tool, result):
logger.info(json.dumps({
"event": "tool_complete",
"request_id": self.request_id,
"tool": tool.name,
"result_length": len(str(result)),
"elapsed_ms": (datetime.now() - self.start_time).total_seconds() * 1000,
}))
This structured output integrates with log aggregation services and makes it possible to trace a single request through the entire agent execution pipeline.
Graceful Degradation
When the Anthropic API is down or slow, your agent should fail gracefully rather than hanging or crashing.
import asyncio
async def handle_request(agent, message, timeout_seconds=30):
try:
result = await asyncio.wait_for(
Runner.run(agent, message, max_turns=10),
timeout=timeout_seconds
)
return result.final_output
except asyncio.TimeoutError:
return "The request took too long. Please try again."
except Exception as e:
logger.error(f"Agent error: {e}")
return "Something went wrong. Please try again later."
Set timeouts at both the runner level (max_turns) and the application level (asyncio.wait_for). The max_turns limit prevents infinite tool loops. The timeout prevents the entire request from hanging if the API is slow.
The Lesson
Building an agent is not about the library. It is about the decisions you make when using it. Which tools to expose. What instructions to write.
Where to put guardrails. How to handle failures.
The Agent SDK handles the mechanical parts, the agentic loop, tool execution, conversation management, so you can focus on these decisions. Every hour previously spent debugging a hand-rolled loop is now an hour spent improving tools and instructions.
The patterns in this guide transfer to any agent you build. The tools will be different. The domain will be different. But the structure, clear tool definitions with precise docstrings, thoughtful instructions, guardrails at the boundaries, observability throughout, remains the same.
For a comparison of how the Agent SDK stacks up against other frameworks, our guide on Claude Agent SDK vs LangChain covers that in detail. And to extend your agents with external services, the guide on MCP servers and extensions shows how to connect agents to the broader tool ecosystem.
Conclusion
This guide started by describing a first agent built the hard way with raw API calls and manual loops. TaskBot is the same kind of agent, but built in an afternoon instead of a week. It has guardrails, observability, error handling, and multi-turn support.
It is testable. It is maintainable. And the code is readable enough that a new team member can understand it without a walkthrough.
The Agent SDK is not doing anything you could not do yourself. It is doing what you would do yourself, but tested, maintained, and improved by the team that builds Claude. That is the value proposition. Not magic. Leverage.
Start with a single tool. Get it working. Add a second. Add guardrails. Add observability.
Each step is small. Each step makes your agent more capable and more reliable. And every agent you build after the first one is faster, because the patterns are the same.
Build something. Ship it. Watch how your users interact with it. Then improve it. That is the loop that matters more than any agentic loop in any SDK.