Get Started in 60 Seconds

Install the Agent SDK and set your API key:

pip install openai-agents anthropic
export ANTHROPIC_API_KEY="your-key-here"

Create agent.py and paste this:

from agents import Agent, Runner, function_tool

@function_tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    return f"The weather in {city} is 22°C and sunny."

agent = Agent(
    name="assistant",
    model="claude-sonnet-4-6",
    instructions="You are a helpful assistant. Use tools when needed.",
    tools=[get_weather],
)

result = Runner.run_sync(agent, "What is the weather in London?")
print(result.final_output)

Run it:

python agent.py
# Output: The weather in London is 22°C and sunny.

That is a working agent with tool use. The openai-agents package provides the agents module. The SDK handles the agentic loop (send message, detect tool call, execute tool, return result, repeat) so you write only the tool and the instructions.

The rest of this guide builds a production-ready agent with real tools, guardrails, multi-turn conversations, and observability. By the end, you will have a project management agent that creates tasks, checks status, assigns work, and handles the edge cases that real applications encounter.

Why Use the Agent SDK

Building agents with raw API calls means writing the agentic loop yourself: send a message, check for tool calls, parse them, execute them, send results back, and repeat. The Agent SDK handles all of that, plus conversation management and error handling, so you focus on defining tools and instructions.

Capability Raw Anthropic SDK (anthropic-sdk-python) Agent SDK (openai-agents) LangChain / LangGraph (langchain)
Direct Messages API access built-in wrapped wrapped
Managed agentic loop you write it built-in (Runner.run) built-in (AgentExecutor, LangGraph)
Tool definition via Python decorators n/a @function_tool @tool
Input and output guardrails as first-class config n/a InputGuardrail, OutputGuardrail via callbacks / parsers
Agent handoffs built in n/a handoffs=[...] via LangGraph state machine
Lifecycle hooks (on_tool_start, on_tool_end) n/a RunHooks BaseCallbackHandler
Vendor lock-in Anthropic only multi-provider (Anthropic, OpenAI, others) multi-provider
Dependency count on install 1 (anthropic) 2 (openai-agents, anthropic) 20+ transitive via langchain, langchain-core, langchain-community

Data source: openai-agents on PyPI, anthropic-sdk-python on GitHub, and langchain on PyPI, as of 2026-04.

The Problem

Every agent shares the same core challenge. You need to give an LLM the ability to take actions in the real world, while keeping control over what those actions are, how they execute, and what happens when they fail.

The raw approach means writing the agentic loop yourself. Send a message to the API. Check if the response contains tool calls. If it does, parse them, execute them, send the results back, and check again. Handle errors at each step.

Manage conversation history. Track token usage. Implement timeouts. Add logging.

That loop is not complex in principle. It is tedious in practice. And the tedium creates bugs.

A missed error handler. A conversation history that grows without bound. A tool call parser that breaks on edge cases.

The Agent SDK eliminates the tedium. It gives you a tested, maintained implementation of the agentic loop, and it lets you define the interesting parts, the tools, the instructions, the guardrails, as simple Python functions and configuration.

Here is how.

The Journey: Build Your AI Agent Step by Step

What We Are Building

The agent we are building is called TaskBot. It manages a simple project board with these capabilities.

It can create tasks with a title, description, priority, and assignee. It can list tasks filtered by status or assignee. It can update task status. It can send notifications when tasks change. And it can provide summaries of project progress.

This is a realistic use case. It touches databases, external APIs, and business logic. It requires multi-turn conversations where the user asks follow-up questions. And it needs guardrails to prevent misuse.

The full code is at the end of this guide. We will build it piece by piece so you understand each decision.

Setting Up the Project

Start by creating a project directory and installing the SDK.

mkdir taskbot && cd taskbot
python -m venv venv
source venv/bin/activate
pip install openai-agents anthropic

The Agent SDK is distributed as the openai-agents package, which supports multiple model providers including Claude through the Anthropic integration. You will also need an Anthropic API key.

export ANTHROPIC_API_KEY="your-key-here"

Create the project structure.

taskbot/
  agent.py          # Agent definition
  tools.py          # Tool functions
  guardrails.py     # Input/output validation
  models.py         # Data models
  database.py       # Storage layer
  main.py           # Entry point

This structure separates concerns. Tools are pure functions that interact with the database. The agent definition is configuration. Guardrails are validation logic. The entry point ties everything together.

Defining the Data Models

Before writing tools, define what you are working with. This example uses Python dataclasses for simplicity, but Pydantic models work well here too.

# models.py
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional

class TaskStatus(Enum):
    TODO = "todo"
    IN_PROGRESS = "in_progress"
    REVIEW = "review"
    DONE = "done"

class Priority(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

@dataclass
class Task:
    id: str
    title: str
    description: str
    status: TaskStatus = TaskStatus.TODO
    priority: Priority = Priority.MEDIUM
    assignee: Optional[str] = None
    created_at: datetime = field(default_factory=datetime.now)
    updated_at: datetime = field(default_factory=datetime.now)

Building a Simple Storage Layer

For this tutorial, an in-memory database keeps things simple. In production, you would swap this for PostgreSQL, SQLite, or whatever your application uses. The agent does not care about the storage implementation. It only interacts with tools.

# database.py
from datetime import datetime
from models import Task, TaskStatus
from typing import Optional
import uuid

_tasks: dict[str, Task] = {}

def create_task(title: str, description: str,
                priority: str = "medium",
                assignee: Optional[str] = None) -> Task:
    task_id = str(uuid.uuid4())[:8]
    task = Task(
        id=task_id,
        title=title,
        description=description,
        priority=priority,
        assignee=assignee
    )
    _tasks[task_id] = task
    return task

def get_task(task_id: str) -> Optional[Task]:
    return _tasks.get(task_id)

def list_tasks(status: Optional[str] = None,
               assignee: Optional[str] = None) -> list[Task]:
    tasks = list(_tasks.values())
    if status:
        tasks = [t for t in tasks if t.status.value == status]
    if assignee:
        tasks = [t for t in tasks if t.assignee == assignee]
    return tasks

def update_task_status(task_id: str, new_status: str) -> Optional[Task]:
    task = _tasks.get(task_id)
    if task:
        task.status = TaskStatus(new_status)
        task.updated_at = datetime.now()
    return task

Defining Tools

Tools are where the agent meets the real world. Each tool is a Python function decorated with @function_tool. The function's name becomes the tool's name. The docstring becomes the tool's description, which Claude reads to understand when and how to use the tool. Type hints become the parameter schema.

# tools.py
from agents import function_tool
from database import create_task, get_task, list_tasks, update_task_status
from typing import Optional

@function_tool
def create_new_task(title: str, description: str,
                    priority: str = "medium",
                    assignee: Optional[str] = None) -> str:
    """Create a new task on the project board.

    Args:
        title: Short title for the task
        description: Detailed description of what needs to be done
        priority: One of 'low', 'medium', 'high', or 'critical'
        assignee: Name of the person to assign the task to
    """
    task = create_task(title, description, priority, assignee)
    return (
        f"Created task {task.id}: '{task.title}' "
        f"(priority: {priority}, assignee: {assignee or 'unassigned'})"
    )

@function_tool
def list_project_tasks(status: Optional[str] = None,
                       assignee: Optional[str] = None) -> str:
    """List tasks on the project board, optionally filtered.

    Args:
        status: Filter by status ('todo', 'in_progress', 'review', 'done')
        assignee: Filter by assignee name
    """
    tasks = list_tasks(status, assignee)
    if not tasks:
        return "No tasks found matching the criteria."

    lines = []
    for t in tasks:
        lines.append(
            f"[{t.id}] {t.title} | "
            f"Status: {t.status.value} | "
            f"Priority: {t.priority} | "
            f"Assignee: {t.assignee or 'unassigned'}"
        )
    return "\n".join(lines)

@function_tool
def update_status(task_id: str, new_status: str) -> str:
    """Update the status of an existing task.

    Args:
        task_id: The ID of the task to update
        new_status: New status ('todo', 'in_progress', 'review', 'done')
    """
    task = update_task_status(task_id, new_status)
    if task:
        return f"Updated task {task_id} to status '{new_status}'."
    return f"Task {task_id} not found."

@function_tool
def get_task_details(task_id: str) -> str:
    """Get full details of a specific task.

    Args:
        task_id: The ID of the task to look up
    """
    task = get_task(task_id)
    if not task:
        return f"Task {task_id} not found."

    return (
        f"Task {task.id}\n"
        f"Title: {task.title}\n"
        f"Description: {task.description}\n"
        f"Status: {task.status.value}\n"
        f"Priority: {task.priority}\n"
        f"Assignee: {task.assignee or 'unassigned'}\n"
        f"Created: {task.created_at.isoformat()}\n"
        f"Updated: {task.updated_at.isoformat()}"
    )

@function_tool
def send_notification(recipient: str, message: str) -> str:
    """Send a notification to a team member.

    Args:
        recipient: Name of the person to notify
        message: The notification message
    """
    # In production, this would call an email/Slack/Teams API
    print(f"[NOTIFICATION to {recipient}]: {message}")
    return f"Notification sent to {recipient}."

Notice that each tool returns a string. The SDK sends this string back to Claude as the tool result. Claude uses it to formulate its response. Clear, informative return values make Claude's responses better.

The docstrings matter enormously. Claude reads them to decide which tool to use and how to call it. A vague docstring produces vague tool usage. A precise docstring with documented parameters produces precise tool calls.

Creating the Agent

With tools defined, the agent itself is straightforward.

# agent.py
from agents import Agent
from tools import (
    create_new_task,
    list_project_tasks,
    update_status,
    get_task_details,
    send_notification
)

taskbot = Agent(
    name="TaskBot",
    model="claude-sonnet-4-6",
    instructions="""You are TaskBot, a project management assistant.
    You help teams manage their tasks and stay organised.

    Guidelines:
    - When creating tasks, always confirm the details with the user first
    - Use 'medium' priority unless the user specifies otherwise
    - When updating task status, notify the assignee
    - Provide concise summaries when listing tasks
    - If a task ID is not found, suggest listing all tasks
    - Always be helpful but never create tasks without explicit user request""",
    tools=[
        create_new_task,
        list_project_tasks,
        update_status,
        get_task_details,
        send_notification
    ]
)

The instructions field is your system prompt. It shapes how the agent behaves across all conversations. Write it like you are briefing a new team member. Be specific about what the agent should and should not do.

The model field determines which Claude model powers the agent. Use claude-sonnet-4-6 for a good balance of speed and capability. Use claude-opus-4-6 for tasks that require deeper reasoning. Use claude-haiku-4-6 for simple, high-volume tasks where speed matters most. For a broader view of how Claude Code's capabilities stack up against other AI development tools, see our Claude Code vs Cursor comparison.

Understanding the Agentic Loop

When you call Runner.run_sync(taskbot, "Create a task for the API migration"), this is what happens internally.

First, the SDK sends your message to Claude along with the system instructions and the tool definitions. Claude receives everything it needs to understand who it is, what it can do, and what the user wants.

Second, Claude responds. If it decides to call a tool, the response contains a tool use block with the tool name and parameters. If it decides to respond directly, the response contains text.

Third, if there was a tool call, the SDK executes your tool function with the provided parameters. It captures the return value as a string.

Fourth, the SDK sends the tool result back to Claude. Claude now knows what happened and can decide to call another tool, ask a follow-up question, or provide a final response.

This loop continues until Claude produces a response with no tool calls. That response becomes the final_output of the run.

The key insight is that you never write this loop. The SDK handles it. Your job is to define the tools and instructions that shape what happens inside the loop.

Step Actor What happens Data passed
1 Runner Sends system prompt, tool schemas, and the user message to Claude Messages array plus tool definitions
2 Claude Returns either a final text response or one or more tool_use blocks Assistant message with stop_reason
3 Runner Invokes each Python function matched to a tool_use block and captures the return value Tool name plus arguments dict
4 Runner Appends a tool_result block and re-sends the full conversation to Claude Tool result string keyed to tool_use_id
5 Loop Steps 2 to 4 repeat until Claude returns stop_reason: "end_turn" or the max_turns cap is hit RunResult.final_output

Data source: Anthropic tool use docs and Agent SDK overview, as of 2026-04.

Multi-Turn Conversations

A single Runner.run_sync() call handles one turn of conversation. For multi-turn interactions where the user asks follow-up questions, you need to maintain the conversation history.

# main.py
from agents import Runner
from agent import taskbot

async def chat():
    history = []

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ("quit", "exit"):
            break

        # Add user message to history
        history.append({
            "role": "user",
            "content": user_input
        })

        result = await Runner.run(
            taskbot,
            history
        )

        # Add assistant response to history
        history.append({
            "role": "assistant",
            "content": result.final_output
        })

        print(f"\nTaskBot: {result.final_output}")

if __name__ == "__main__":
    import asyncio
    asyncio.run(chat())

The conversation history is a list of message objects. Each message has a role (user or assistant) and content. The SDK sends the full history with each request, giving Claude the context of the entire conversation.

Be mindful of history length. Each message consumes tokens. For long-running sessions, you may need to summarise older messages or implement a sliding window that keeps only the most recent exchanges.

Adding Guardrails

Guardrails are functions that validate inputs before the agent processes them and outputs before the agent returns them. They are your safety layer.

# guardrails.py
from agents import (
    InputGuardrail,
    OutputGuardrail,
    GuardrailFunctionOutput,
    Agent,
    Runner
)

BLOCKED_TERMS = [
    "delete all", "drop table", "remove everything",
    "fire everyone", "terminate all"
]

async def validate_input(ctx, agent, user_input):
    """Block potentially destructive or harmful requests."""
    lower_input = user_input.lower() if isinstance(user_input, str) else ""
    triggered = any(term in lower_input for term in BLOCKED_TERMS)

    return GuardrailFunctionOutput(
        output_info={
            "blocked_term_found": triggered,
            "input_length": len(lower_input)
        },
        tripwire_triggered=triggered
    )

async def validate_output(ctx, agent, output):
    """Ensure the agent does not leak internal details."""
    lower_output = output.lower() if isinstance(output, str) else ""
    leaks_internals = any(
        term in lower_output
        for term in ["api_key", "database_url", "internal_secret"]
    )

    return GuardrailFunctionOutput(
        output_info={"leaks_internals": leaks_internals},
        tripwire_triggered=leaks_internals
    )

Now add the guardrails to the agent definition.

from guardrails import validate_input, validate_output
from agents import InputGuardrail, OutputGuardrail

taskbot = Agent(
    name="TaskBot",
    model="claude-sonnet-4-6",
    instructions="...",
    tools=[...],
    input_guardrails=[
        InputGuardrail(guardrail_function=validate_input)
    ],
    output_guardrails=[
        OutputGuardrail(guardrail_function=validate_output)
    ]
)

When a guardrail trips, the SDK raises an exception that you catch in your application code. The agent never sees the blocked input. The user gets a clear error message.

from agents.exceptions import InputGuardrailTripwireTriggered

try:
    result = await Runner.run(taskbot, user_input)
except InputGuardrailTripwireTriggered:
    print("That request was blocked by our safety policy.")

We recommend running guardrails on every agent in production. The overhead is minimal, typically a few milliseconds for pattern matching. The protection is significant. A guardrail that catches one destructive request has paid for itself permanently.

Error Handling

Tools fail. APIs time out. Databases go down. Your agent needs to handle these failures gracefully.

The simplest approach is to handle errors within the tool function itself.

@function_tool
def create_new_task(title: str, description: str,
                    priority: str = "medium",
                    assignee: Optional[str] = None) -> str:
    """Create a new task on the project board."""
    try:
        # Validate priority
        valid_priorities = ["low", "medium", "high", "critical"]
        if priority not in valid_priorities:
            return (
                f"Invalid priority '{priority}'. "
                f"Must be one of: {', '.join(valid_priorities)}"
            )

        task = create_task(title, description, priority, assignee)
        return (
            f"Created task {task.id}: '{task.title}' "
            f"(priority: {priority})"
        )
    except Exception as e:
        return f"Failed to create task: {str(e)}"

By returning error messages as strings rather than raising exceptions, you let Claude handle the failure conversationally. Claude sees "Failed to create task: database connection timeout" and can tell the user what happened, suggest a retry, or try an alternative approach.

For critical failures that should stop the agent entirely, raise an exception. The SDK will stop the agentic loop and propagate the error to your application code.

You should also set timeouts at the runner level to prevent agents from running indefinitely.

result = await Runner.run(
    taskbot,
    user_input,
    max_turns=10  # Stop after 10 tool call cycles
)

The max_turns parameter prevents infinite loops where the agent keeps calling tools without reaching a conclusion. Ten turns is a reasonable default for most agents. Increase it for agents that need to perform many sequential operations.

Agent Handoffs

Sometimes a single agent cannot handle everything. The Agent SDK supports handoffs, where one agent delegates to another for specialised tasks.

Imagine TaskBot needs to handle both project management and time tracking. Instead of cramming both into one agent, you create two specialised agents and let them hand off to each other.

from agents import Agent

time_tracker = Agent(
    name="TimeTracker",
    model="claude-sonnet-4-6",
    instructions="""You track time spent on tasks.
    You can log hours, view time reports, and calculate
    utilisation rates. For task management questions,
    hand off to the TaskBot agent.""",
    tools=[log_time, get_time_report, calculate_utilisation],
    handoffs=[]  # Will be set after taskbot is defined
)

taskbot = Agent(
    name="TaskBot",
    model="claude-sonnet-4-6",
    instructions="""You manage project tasks.
    For time tracking questions, hand off to the
    TimeTracker agent.""",
    tools=[
        create_new_task,
        list_project_tasks,
        update_status,
        get_task_details,
        send_notification
    ],
    handoffs=[time_tracker]
)

# Complete the circular reference
time_tracker.handoffs = [taskbot]

When a user asks TaskBot "How many hours did Sarah log this week?", TaskBot recognises this is a time tracking question and hands off to TimeTracker. TimeTracker handles the request with its specialised tools and returns the result.

This pattern keeps each agent focused. Focused agents are easier to test, easier to debug, and produce better results because their instructions and tools are not diluted by unrelated capabilities.

Observability and Monitoring

In production, you need to know what your agent is doing. The SDK provides hooks that let you observe every step of the agentic loop.

from agents import RunHooks, RunContextWrapper, Tool, Agent
from datetime import datetime

class ProductionHooks(RunHooks):
    def __init__(self):
        self.tool_calls = []
        self.start_time = None
        self.total_tokens = 0

    async def on_agent_start(self, context, agent):
        self.start_time = datetime.now()
        print(f"[{self.start_time}] Agent '{agent.name}' started")

    async def on_tool_start(self, context, agent, tool):
        print(f"  [TOOL CALL] {tool.name}")

    async def on_tool_end(self, context, agent, tool, result):
        self.tool_calls.append({
            "tool": tool.name,
            "timestamp": datetime.now().isoformat(),
            "result_length": len(str(result))
        })
        print(f"  [TOOL DONE] {tool.name} ({len(str(result))} chars)")

    async def on_agent_end(self, context, agent, output):
        duration = (datetime.now() - self.start_time).total_seconds()
        print(f"[COMPLETE] {len(self.tool_calls)} tool calls "
              f"in {duration:.1f}s")

hooks = ProductionHooks()
result = await Runner.run(
    taskbot,
    user_input,
    run_hooks=hooks
)

These hooks give you structured data about every agent execution. In production, send this data to a logging service for analysis. Common things to track include the number of tool calls per conversation, which tools are used most frequently, average execution time, error rates, and token consumption patterns.

If you are building agents that connect to external services through MCP servers, observability becomes even more important. MCP tool calls cross network boundaries, so you need to track latency and failures at each hop. When these connections involve authentication or sensitive data, follow our MCP authentication and security best practices.

Production Patterns

Several patterns have proven essential in production agents.

Async execution. The SDK supports async natively. Use Runner.run() instead of Runner.run_sync() in production to avoid blocking your application's event loop.

result = await Runner.run(taskbot, user_input)

Rate limiting. If your agent handles multiple users, implement rate limiting to avoid API quota exhaustion.

import asyncio
from collections import defaultdict
from time import time

class RateLimiter:
    def __init__(self, max_requests_per_minute=20):
        self.max_rpm = max_requests_per_minute
        self.requests = defaultdict(list)

    async def check(self, user_id: str):
        now = time()
        # Clean old entries
        self.requests[user_id] = [
            t for t in self.requests[user_id]
            if now - t < 60
        ]
        if len(self.requests[user_id]) >= self.max_rpm:
            raise Exception("Rate limit exceeded. Please wait.")
        self.requests[user_id].append(now)

limiter = RateLimiter()

async def handle_request(user_id: str, message: str):
    await limiter.check(user_id)
    result = await Runner.run(taskbot, message)
    return result.final_output

Cost tracking. Every agent call consumes tokens. Track usage to avoid surprises on your bill. The same model selection and context management habits that apply to interactive Claude Code sessions apply to agents; our guide on Claude Code cost optimisation covers the full set of techniques.

Worked example: a 5-turn TaskBot conversation that creates two tasks and lists them. Input counts include the system prompt, tool schemas, and full conversation history re-sent on each turn. Output counts cover Claude's text responses and tool_use blocks.

Turn Scenario Input tokens Output tokens
1 User asks to create a task, Claude emits tool_use 1,800 120
2 Runner returns tool_result, Claude confirms creation 2,100 90
3 User asks to create a second task, Claude emits tool_use 2,400 130
4 Runner returns tool_result, Claude confirms creation 2,700 100
5 User asks "show me all tasks", Claude calls list_tasks then replies 3,100 260
Total 12,100 700

Applying published Anthropic list prices (claude-sonnet-4-6: $3 per million input tokens, $15 per million output tokens; claude-haiku-4-6: $0.80 per million input tokens, $4 per million output tokens):

Model Input cost Output cost Total per conversation
claude-sonnet-4-6 $0.0363 $0.0105 $0.0468
claude-haiku-4-6 $0.0097 $0.0028 $0.0125

A high-volume TaskBot handling 10,000 conversations per month on Sonnet costs about $468. The same traffic on Haiku costs about $125. Prompt caching cuts repeated-system-prompt input cost by up to 90% on cached reads.

Data source: Anthropic pricing page, Messages API reference, and prompt caching docs, as of 2026-04. Token counts are approximations for a realistic TaskBot run; actual usage varies with tool schemas and system prompt length.

class CostTracker(RunHooks):
    def __init__(self):
        self.total_input_tokens = 0
        self.total_output_tokens = 0

    async def on_agent_end(self, context, agent, output):
        usage = getattr(context, 'usage', None)
        if usage:
            self.total_input_tokens += usage.input_tokens
            self.total_output_tokens += usage.output_tokens

        cost = (
            self.total_input_tokens * 0.003 / 1000 +
            self.total_output_tokens * 0.015 / 1000
        )
        print(f"Session cost so far: ${cost:.4f}")

Testing Your Agent

Testing agents requires two layers. Unit tests for individual tools and integration tests for the full agent loop.

Unit testing tools. Since tools are regular Python functions (wrapped with a decorator), you can test them directly.

# test_tools.py
import pytest
from database import create_task, list_tasks, _tasks

def setup_function():
    """Clear the database before each test."""
    _tasks.clear()

def test_create_task():
    task = create_task(
        "Fix login bug",
        "Users cannot log in with SSO",
        "high",
        "Sarah"
    )
    assert task.title == "Fix login bug"
    assert task.priority == "high"
    assert task.assignee == "Sarah"
    assert task.id is not None

def test_list_tasks_filter_by_status():
    create_task("Task 1", "Description", "medium")
    task2 = create_task("Task 2", "Description", "high")
    task2.status = TaskStatus.DONE

    todo_tasks = list_tasks(status="todo")
    assert len(todo_tasks) == 1
    assert todo_tasks[0].title == "Task 1"

def test_list_tasks_empty():
    tasks = list_tasks()
    assert tasks == []

Integration testing the agent. Use the SDK to run the agent against test inputs and verify the outputs.

# test_agent.py
import pytest
from agents import Runner
from agent import taskbot
from database import _tasks

@pytest.fixture(autouse=True)
def clear_db():
    _tasks.clear()
    yield
    _tasks.clear()

@pytest.mark.asyncio
async def test_agent_creates_task():
    result = await Runner.run(
        taskbot,
        "Create a high priority task called 'Deploy v2.0' "
        "about deploying the new version to production"
    )
    assert len(_tasks) == 1
    task = list(_tasks.values())[0]
    assert "Deploy" in task.title

@pytest.mark.asyncio
async def test_agent_handles_unknown_task():
    result = await Runner.run(
        taskbot,
        "Show me details for task xyz123"
    )
    assert "not found" in result.final_output.lower()

@pytest.mark.asyncio
async def test_guardrail_blocks_destructive_input():
    from agents.exceptions import InputGuardrailTripwireTriggered

    with pytest.raises(InputGuardrailTripwireTriggered):
        await Runner.run(taskbot, "Delete all tasks immediately")

Integration tests are slower because they make API calls. Run them in a separate test suite and use environment variables to point them at a test API key with lower rate limits.

The Complete Working Example

Here is the full TaskBot agent in a single file, ready to run.

#!/usr/bin/env python3
"""TaskBot - A project management agent built with the Claude Agent SDK."""

import asyncio
import uuid
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

from agents import (
    Agent,
    Runner,
    function_tool,
    InputGuardrail,
    GuardrailFunctionOutput,
    RunHooks,
)

# --- Data Models ---

@dataclass
class Task:
    id: str
    title: str
    description: str
    status: str = "todo"
    priority: str = "medium"
    assignee: Optional[str] = None
    created_at: str = field(
        default_factory=lambda: datetime.now().isoformat()
    )

# --- Database ---

tasks_db: dict[str, Task] = {}

# --- Tools ---

@function_tool
def create_task(title: str, description: str,
                priority: str = "medium",
                assignee: Optional[str] = None) -> str:
    """Create a new task on the project board.

    Args:
        title: Short title for the task
        description: What needs to be done
        priority: low, medium, high, or critical
        assignee: Person to assign the task to
    """
    valid = ["low", "medium", "high", "critical"]
    if priority not in valid:
        return f"Invalid priority. Must be one of: {', '.join(valid)}"

    task_id = str(uuid.uuid4())[:8]
    task = Task(
        id=task_id, title=title, description=description,
        priority=priority, assignee=assignee
    )
    tasks_db[task_id] = task
    return (
        f"Created task {task_id}: '{title}' "
        f"(priority: {priority}, "
        f"assignee: {assignee or 'unassigned'})"
    )

@function_tool
def list_tasks(status: Optional[str] = None,
               assignee: Optional[str] = None) -> str:
    """List tasks, optionally filtered by status or assignee.

    Args:
        status: Filter by todo, in_progress, review, or done
        assignee: Filter by assignee name
    """
    filtered = list(tasks_db.values())
    if status:
        filtered = [t for t in filtered if t.status == status]
    if assignee:
        filtered = [t for t in filtered if t.assignee == assignee]

    if not filtered:
        return "No tasks found."

    lines = []
    for t in filtered:
        lines.append(
            f"[{t.id}] {t.title} | {t.status} | "
            f"{t.priority} | {t.assignee or 'unassigned'}"
        )
    return "\n".join(lines)

@function_tool
def update_task(task_id: str, new_status: str) -> str:
    """Update the status of a task.

    Args:
        task_id: The task ID
        new_status: New status (todo, in_progress, review, done)
    """
    valid = ["todo", "in_progress", "review", "done"]
    if new_status not in valid:
        return f"Invalid status. Must be one of: {', '.join(valid)}"

    task = tasks_db.get(task_id)
    if not task:
        return f"Task {task_id} not found."

    old_status = task.status
    task.status = new_status
    return (
        f"Updated task {task_id} from '{old_status}' "
        f"to '{new_status}'."
    )

@function_tool
def get_task(task_id: str) -> str:
    """Get full details of a task.

    Args:
        task_id: The task ID to look up
    """
    task = tasks_db.get(task_id)
    if not task:
        return f"Task {task_id} not found."

    return (
        f"Task: {task.id}\n"
        f"Title: {task.title}\n"
        f"Description: {task.description}\n"
        f"Status: {task.status}\n"
        f"Priority: {task.priority}\n"
        f"Assignee: {task.assignee or 'unassigned'}\n"
        f"Created: {task.created_at}"
    )

@function_tool
def notify(recipient: str, message: str) -> str:
    """Send a notification to a team member.

    Args:
        recipient: Person to notify
        message: Notification message
    """
    print(f"  [NOTIFY {recipient}]: {message}")
    return f"Notification sent to {recipient}."

# --- Guardrails ---

async def check_input(ctx, agent, user_input):
    blocked = ["delete all", "remove everything", "drop table"]
    text = user_input.lower() if isinstance(user_input, str) else ""
    triggered = any(term in text for term in blocked)
    return GuardrailFunctionOutput(
        output_info={"blocked": triggered},
        tripwire_triggered=triggered
    )

# --- Hooks ---

class AgentLogger(RunHooks):
    async def on_tool_start(self, context, agent, tool):
        print(f"  > Calling {tool.name}...")

    async def on_tool_end(self, context, agent, tool, result):
        preview = str(result)[:80]
        print(f"  < {tool.name} returned: {preview}")

# --- Agent ---

taskbot = Agent(
    name="TaskBot",
    model="claude-sonnet-4-6",
    instructions="""You are TaskBot, a project management assistant.

    Rules:
    - Confirm details before creating tasks
    - Default to medium priority unless told otherwise
    - Notify assignees when their tasks change status
    - Be concise and helpful
    - Never create tasks without an explicit request""",
    tools=[create_task, list_tasks, update_task, get_task, notify],
    input_guardrails=[
        InputGuardrail(guardrail_function=check_input)
    ]
)

# --- Main ---

async def main():
    print("TaskBot ready. Type 'quit' to exit.\n")
    history = []
    hooks = AgentLogger()

    while True:
        user_input = input("You: ")
        if user_input.lower() in ("quit", "exit"):
            break

        history.append({"role": "user", "content": user_input})

        try:
            result = await Runner.run(
                taskbot, history,
                run_hooks=hooks, max_turns=10
            )
            response = result.final_output
            history.append(
                {"role": "assistant", "content": response}
            )
            print(f"\nTaskBot: {response}\n")

        except Exception as e:
            print(f"\nError: {e}\n")

if __name__ == "__main__":
    asyncio.run(main())

Save this as taskbot.py, set your ANTHROPIC_API_KEY, and run it with python taskbot.py. You will have a working project management agent that creates tasks, tracks status, sends notifications, and blocks destructive inputs.

Troubleshooting Common Errors

Building agents involves moving parts that fail in predictable ways. This section covers the errors you will actually encounter, with the exact messages and fixes.

"ANTHROPIC_API_KEY is not set"

This appears when the SDK cannot find your API key. The SDK checks the ANTHROPIC_API_KEY environment variable at runtime.

# Verify the key is set
echo $ANTHROPIC_API_KEY

# If empty, set it
export ANTHROPIC_API_KEY="sk-ant-..."

On macOS and Linux, environment variables set with export only persist for the current shell session. For permanent configuration, add the export to your ~/.bashrc, ~/.zshrc, or use a .env file with python-dotenv.

A common mistake is setting ANTHROPIC_API_KEY in one terminal and running the agent in another. Each terminal session has its own environment.

"Tool function must return a string"

The SDK expects every tool function to return a str. If your tool returns None (for example, a function that performs an action but has no explicit return statement), the SDK raises this error.

# Wrong: implicit None return
@function_tool
def delete_item(item_id: str):
    """Delete an item."""
    database.delete(item_id)

# Correct: always return a string
@function_tool
def delete_item(item_id: str) -> str:
    """Delete an item."""
    database.delete(item_id)
    return f"Item {item_id} deleted."

"Maximum turns exceeded"

This means the agent hit the max_turns limit without producing a final response. It usually indicates one of two problems. Either the agent is stuck in a loop (calling the same tool repeatedly with the same arguments) or the task genuinely requires more steps than the limit allows.

# Diagnose with hooks
class DebugHooks(RunHooks):
    async def on_tool_start(self, context, agent, tool):
        print(f"Turn: calling {tool.name}")

result = await Runner.run(
    agent, message,
    max_turns=10,
    run_hooks=DebugHooks()
)

If the logs show the same tool call repeating, the issue is usually a tool that returns ambiguous results. Claude cannot determine whether the action succeeded, so it tries again. Make tool return values explicit about success or failure.

"InputGuardrailTripwireTriggered" in Production

When a guardrail trips, the SDK raises InputGuardrailTripwireTriggered. If you do not catch this exception, your application crashes. Always wrap Runner.run() calls in try/except blocks in production code.

from agents.exceptions import (
    InputGuardrailTripwireTriggered,
    OutputGuardrailTripwireTriggered
)

try:
    result = await Runner.run(agent, user_input)
except InputGuardrailTripwireTriggered as e:
    # Log the blocked input for security review
    logger.warning(f"Input blocked: {e}")
    return "That request cannot be processed."
except OutputGuardrailTripwireTriggered as e:
    logger.warning(f"Output blocked: {e}")
    return "The response was filtered for safety."

Rate Limit Errors (429)

The Anthropic API returns HTTP 429 when you exceed your rate limit. The SDK does not automatically retry. You need to handle this yourself.

import asyncio
from anthropic import RateLimitError

async def run_with_retry(agent, message, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await Runner.run(agent, message)
        except RateLimitError:
            wait_time = 2 ** attempt  # 1s, 2s, 4s
            print(f"Rate limited. Retrying in {wait_time}s...")
            await asyncio.sleep(wait_time)
    raise Exception("Max retries exceeded")

JSON Serialisation Errors in Tool Arguments

Claude occasionally sends tool arguments that do not match the expected types. A parameter typed as int might receive a string like "42". The SDK validates types before calling your function, but edge cases exist with complex nested types.

Keep tool parameter types simple. Use str, int, float, bool, and Optional variants. Avoid deeply nested Pydantic models as tool parameters. If you need complex input, accept a JSON string and parse it inside the tool function.

Production Deployment Considerations

Moving from a local script to a production deployment introduces several concerns beyond the code itself.

Environment Configuration

Separate your agent configuration from your code. Model names, API keys, rate limits, and feature flags should come from environment variables or a configuration file, not hardcoded values.

import os

MODEL = os.environ.get("AGENT_MODEL", "claude-sonnet-4-6")
MAX_TURNS = int(os.environ.get("AGENT_MAX_TURNS", "10"))
MAX_RPM = int(os.environ.get("AGENT_MAX_RPM", "20"))

agent = Agent(
    name="TaskBot",
    model=MODEL,
    instructions="...",
    tools=[...],
)

This lets you run the same code with claude-haiku-4-6 in development (faster, cheaper) and claude-sonnet-4-6 in production without changing any code.

Conversation History Management

In production, conversation histories grow unbounded. A user who interacts with your agent for an hour can accumulate enough history to exceed Claude's context window. Implement a sliding window or summarisation strategy.

MAX_HISTORY_MESSAGES = 20

async def chat_with_limit(agent, history, user_input):
    history.append({"role": "user", "content": user_input})

    # Trim to last N messages, keeping the system context
    if len(history) > MAX_HISTORY_MESSAGES:
        history = history[-MAX_HISTORY_MESSAGES:]

    result = await Runner.run(agent, history)
    history.append({"role": "assistant", "content": result.final_output})
    return result.final_output

For more sophisticated approaches, summarise older messages into a single context message that captures the key facts from the conversation so far. This preserves important context while keeping token usage predictable.

Structured Logging for Debugging

Print statements work during development. In production, use structured logging that includes request IDs, user IDs, and timing information.

import logging
import json
from datetime import datetime

logger = logging.getLogger("agent")

class StructuredHooks(RunHooks):
    def __init__(self, request_id: str, user_id: str):
        self.request_id = request_id
        self.user_id = user_id
        self.start_time = None

    async def on_agent_start(self, context, agent):
        self.start_time = datetime.now()
        logger.info(json.dumps({
            "event": "agent_start",
            "request_id": self.request_id,
            "user_id": self.user_id,
            "agent": agent.name,
        }))

    async def on_tool_end(self, context, agent, tool, result):
        logger.info(json.dumps({
            "event": "tool_complete",
            "request_id": self.request_id,
            "tool": tool.name,
            "result_length": len(str(result)),
            "elapsed_ms": (datetime.now() - self.start_time).total_seconds() * 1000,
        }))

This structured output integrates with log aggregation services and makes it possible to trace a single request through the entire agent execution pipeline.

Graceful Degradation

When the Anthropic API is down or slow, your agent should fail gracefully rather than hanging or crashing.

import asyncio

async def handle_request(agent, message, timeout_seconds=30):
    try:
        result = await asyncio.wait_for(
            Runner.run(agent, message, max_turns=10),
            timeout=timeout_seconds
        )
        return result.final_output
    except asyncio.TimeoutError:
        return "The request took too long. Please try again."
    except Exception as e:
        logger.error(f"Agent error: {e}")
        return "Something went wrong. Please try again later."

Set timeouts at both the runner level (max_turns) and the application level (asyncio.wait_for). The max_turns limit prevents infinite tool loops. The timeout prevents the entire request from hanging if the API is slow.

The Lesson

Building an agent is not about the library. It is about the decisions you make when using it. Which tools to expose. What instructions to write.

Where to put guardrails. How to handle failures.

The Agent SDK handles the mechanical parts, the agentic loop, tool execution, conversation management, so you can focus on these decisions. Every hour previously spent debugging a hand-rolled loop is now an hour spent improving tools and instructions.

The patterns in this guide transfer to any agent you build. The tools will be different. The domain will be different. But the structure, clear tool definitions with precise docstrings, thoughtful instructions, guardrails at the boundaries, observability throughout, remains the same.

For a comparison of how the Agent SDK stacks up against other frameworks, our guide on Claude Agent SDK vs LangChain covers that in detail. And to extend your agents with external services, the guide on MCP servers and extensions shows how to connect agents to the broader tool ecosystem.

Conclusion

This guide started by describing a first agent built the hard way with raw API calls and manual loops. TaskBot is the same kind of agent, but built in an afternoon instead of a week. It has guardrails, observability, error handling, and multi-turn support.

It is testable. It is maintainable. And the code is readable enough that a new team member can understand it without a walkthrough.

The Agent SDK is not doing anything you could not do yourself. It is doing what you would do yourself, but tested, maintained, and improved by the team that builds Claude. That is the value proposition. Not magic. Leverage.

Start with a single tool. Get it working. Add a second. Add guardrails. Add observability.

Each step is small. Each step makes your agent more capable and more reliable. And every agent you build after the first one is faster, because the patterns are the same.

Build something. Ship it. Watch how your users interact with it. Then improve it. That is the loop that matters more than any agentic loop in any SDK.