"The agent stops halfway." "It calls the same tool five times and goes nowhere." "It takes the completely wrong action." These are the things people say when AI agent debugging goes wrong.
What makes agents hard to debug is that the model's decision process isn't fully transparent. But the failure modes do cluster into recognizable patterns. Here's how to categorize the problem first, then work toward the root cause.
Step 0: Identify Which Category of Failure You Have
"The agent doesn't work" covers wildly different situations. Narrowing it to one of four categories focuses the diagnosis:
Category 1: Agent won't start — Tool configuration, authentication, prompt format
Category 2: Agent stops mid-task — Tool call failures, input format mismatch, timeout
Category 3: Agent loops — Unclear completion criteria, tool responses not informing next action
Category 4: Agent takes wrong actions — Ambiguous system prompt, vague tool descriptions, counterproductive examples
Pick the one that matches your situation. The fix is in a completely different place for each.
Category 1: Agent Won't Start
When the agent fails immediately, start with the simplest possible test and add complexity one step at a time.
import os
from google import genai
from google.genai import types
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
# Step 1: Basic connectivity — no tools
response = client.models.generate_content(
model="gemini-2.5-pro",
contents="Hello. This is a connectivity test."
)
print("Connectivity:", response.text[:50])
# Step 2: Add exactly one tool
def get_current_time() -> str:
"""Returns the current time."""
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC")
response = client.models.generate_content(
model="gemini-2.5-pro",
contents="What time is it?",
config=types.GenerateContentConfig(
tools=[get_current_time],
system_instruction="Use the available tools to answer user requests."
)
)
print("Single tool:", response.candidates[0].content)If step 1 fails, it's a connectivity or authentication issue. If step 2 fails after step 1 succeeds, it's tool configuration. Adding complexity incrementally prevents you from guessing at the cause.
Category 2: Tool Call Failures
When the agent is calling tools but not getting useful results, instrument the tools themselves.
import json
import logging
from datetime import datetime
from functools import wraps
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
def debug_tool(func):
"""Decorator that logs tool inputs, outputs, and errors."""
@wraps(func)
def wrapper(*args, **kwargs):
call_id = datetime.now().strftime("%H:%M:%S.%f")
logger.debug(f"[{call_id}] CALL: {func.__name__}, args={args}, kwargs={kwargs}")
try:
result = func(*args, **kwargs)
logger.debug(f"[{call_id}] OK: {str(result)[:200]}")
return result
except Exception as e:
logger.error(f"[{call_id}] ERROR: {type(e).__name__}: {e}")
# Return error as string so the agent can reason about it
return f"Tool error: {type(e).__name__}: {str(e)}"
return wrapper
@debug_tool
def search_database(query: str, limit: int = 10) -> str:
"""Searches the database and returns matching records.
Args:
query: Search query string
limit: Maximum number of results to return (default: 10)
Returns:
JSON string containing matching records, or empty list if none found.
"""
results = perform_actual_search(query, limit)
return json.dumps(results, ensure_ascii=False)The key design decision in the error handler: return the error as a string rather than raising an exception. When a tool raises, the agent often stops entirely. When a tool returns an error message, the agent can read it and decide what to do next—retry with different parameters, try a different tool, or tell the user what happened.
Category 3: Loop Detection
Loops are the most disruptive agent failure mode. The agent consumes tokens and time while producing nothing useful.
class AgentLoopDetector:
"""Detects when an agent is stuck in a repetitive loop."""
def __init__(self, max_repetitions: int = 3, max_steps: int = 20):
self.call_history: list[tuple[str, str]] = []
self.step_count = 0
self.max_repetitions = max_repetitions
self.max_steps = max_steps
def check(self, tool_name: str, args: dict) -> bool:
"""Records a tool call and returns True if a loop is detected."""
self.step_count += 1
if self.step_count > self.max_steps:
print(f"⚠️ Step limit ({self.max_steps}) reached")
return True
signature = (tool_name, str(sorted(args.items()) if args else []))
recent = self.call_history[-self.max_repetitions * 2:]
count = recent.count(signature)
self.call_history.append(signature)
if count >= self.max_repetitions - 1:
print(f"⚠️ Loop detected: {tool_name} called {count + 1}x with same args")
print(f" Recent calls: {recent[-6:]}")
return True
return False
detector = AgentLoopDetector(max_repetitions=3, max_steps=15)
def guarded_tool_call(tool_name: str, args: dict):
if detector.check(tool_name, args):
raise RuntimeError(f"Agent loop detected at step {detector.step_count}. Halting.")
return execute_tool(tool_name, args)Loops usually have one root cause: the system prompt doesn't define what "done" looks like. Adding explicit completion criteria fixes most of them:
# Without completion criteria — loops happen
bad_system = """
Answer the user's question. Use tools when needed.
"""
# With completion criteria — loops become rare
good_system = """
You are a research agent.
## Task
Use available tools to collect information and generate an answer.
## When to stop using tools
Stop and provide your final answer when any of these conditions are met:
1. You have enough information to fully answer the question
2. You have run the search tool 3 times without finding relevant results
3. You have called the same tool with the same arguments twice
## Important rules
- Never run the same search query twice
- If information is incomplete, answer with what you have and state what's missing
- Never end a response with "I couldn't find information" — always suggest a next step
"""Category 4: Wrong Actions
When the agent consistently takes the wrong action, the tool description is usually the culprit.
Tool descriptions are the agent's primary guide for deciding which tool to use and how. Vague descriptions produce unreliable behavior.
# Vague — agent will misuse this
def search_knowledge_base(query: str) -> str:
"""Search the knowledge base."""
...
# Clear — agent knows when and how to use this
def search_knowledge_base(query: str, search_type: str = "semantic") -> str:
"""Searches internal documentation and returns relevant articles.
USE THIS TOOL when:
- The user asks about product specifications, manuals, or FAQs
- You need past incident reports or troubleshooting history
DO NOT USE THIS TOOL when:
- Real-time external data is needed (use web_search instead)
- You need database records (use query_database instead)
Args:
query: Natural language question or keyword string
search_type: "semantic" for meaning-based search, "keyword" for exact match
Returns:
JSON with a list of matching documents. Each entry contains:
title, content, relevance_score, and last_updated.
Returns empty list if nothing is found.
"""
...The "USE THIS TOOL / DO NOT USE THIS TOOL" pattern is directly useful to the model—it reads the description as part of tool selection, so explicit guidance translates directly into more reliable behavior.
Four categories, four different fixes. The biggest time-saver is identifying the category before starting to debug—most debugging time is spent in the wrong place.
Adding loop detection and tool call logging from the start costs maybe an hour of setup. Not having them when a production agent goes sideways costs considerably more.