Debugging
When an agent does not behave as expected, understanding what happened inside the engine becomes essential. This page explains how to trace engine behavior, interpret results, and debug common issues.
Understanding Engine Traces
Parlant uses OpenTelemetry-compatible tracing. Each response generates a trace with spans for major operations.
Trace Hierarchy
response_trace
├── preparation
│ ├── load_context_variables
│ ├── load_glossary_terms
│ └── load_capabilities
├── iteration_1
│ ├── guideline_matching
│ │ ├── journey_prediction
│ │ ├── batch_observational
│ │ ├── batch_actionable
│ │ └── relational_resolution
│ └── tool_calling
│ ├── tool_inference
│ └── tool_execution
├── iteration_2
│ └── (same structure)
├── message_generation
│ ├── arq_generation
│ └── message_emission
└── state_persistence
Accessing Traces
Console output: Set log level to DEBUG to see span information:
PARLANT_LOG_LEVEL=DEBUG parlant serve
OTLP export: Configure an OTLP endpoint to send traces to your observability platform (Jaeger, Honeycomb, etc.):
# In your configuration
otlp_endpoint = "http://localhost:4317"
Session events: Each response stores a trace_id in the agent state, linking session events to their execution trace.
Key Observability Points
Different phases expose different information. Here's what to look for:
Guideline Matching Phase
| What to Check | Where to Find It | What It Tells You |
|---|---|---|
| Which guidelines matched | matched_guidelines in iteration state | Guidelines that will influence the response |
| Match scores | score field on each match (0-10) | Confidence in the match |
| Match rationale | rationale field on each match | Why the LLM decided it matched |
| Which guidelines were evaluated | Trace spans per batch | What was actually sent to LLM |
| Which guidelines were pruned | Journey prediction span | What was excluded before LLM evaluation |
Journey State Phase
| What to Check | Where to Find It | What It Tells You |
|---|---|---|
| Active journeys | journey_paths in response state | Which journeys are in progress |
| Current position | Path array for each journey | The current node position |
| Transitions | Changes in path between iterations | How the journey progressed |
Tool Calling Phase
| What to Check | Where to Find It | What It Tells You |
|---|---|---|
| Tools considered | tool_enabled_guideline_matches | Which tools were candidates |
| Inference decisions | Tool inference span | NEEDS_TO_RUN, DATA_ALREADY_IN_CONTEXT, or CANNOT_RUN |
| Execution results | tool_events in response state | What tools returned |
| Blocked tools | tool_insights | What couldn't run and why |
Message Generation Phase
| What to Check | Where to Find It | What It Tells You |
|---|---|---|
| Guidelines used | Generation context | What instructions the LLM saw |
| ARQ artifacts | Structured output before message | How LLM reasoned about guidelines |
| Applied guidelines | applied_guideline_ids in agent state | Which guidelines affected this response |
Debugging Common Issues
Guideline Not Matching
Symptom: A guideline that was expected to apply did not match.
Diagnostic steps:
-
Check if it was evaluated:
Look in trace: Was the guideline in any batch?
If NO → It was pruned (journey prediction excluded it)
If YES → Continue to step 2 -
Check criticality:
Low-criticality guidelines may be aggressively pruned
Consider raising criticality if important -
Check condition wording:
Is the condition clearly met by the conversation?
Try rewording to be more explicit -
Check match rationale:
If evaluated, what did the LLM say?
The rationale explains why it decided "no match" -
Check journey scope:
Is the guideline scoped to a journey?
Is that journey currently active?
Common causes:
- The condition is overly specific (e.g., "customer says exactly 'I want to return this'")
- The condition uses domain jargon that the LLM cannot interpret correctly
- The journey is not active, causing journey-scoped guidelines to be excluded
- The guideline was pruned by journey prediction
Guideline Matching When It Shouldn't
Symptom: A guideline matched when the condition is not actually met.
Diagnostic steps:
-
Read the match rationale:
What reasoning led to the match?
Often reveals interpretation issues -
Check for ambiguous conditions:
"Customer seems frustrated" - what does "seems" mean?
"Customer mentions returns" - mentioning vs. wanting -
Check for overly broad conditions:
"Customer asks a question" - too broad
"Customer asks about products" - still broad
"Customer asks about product pricing" - better
Common causes:
- The condition is too vague
- The LLM is being overly generous in its interpretation
- Similar conditions on different guidelines are causing confusion
Tool Not Called
Symptom: A tool that should have been called was not executed.
Diagnostic steps:
-
Check if guideline matched:
The tool's associated guideline must match first
If guideline didn't match, tool won't be considered -
Check inference decision:
Look in tool_inference span
Was it DATA_ALREADY_IN_CONTEXT? Data might be known
Was it CANNOT_RUN? Parameters were missing -
Check tool_insights:
If CANNOT_RUN, insights explain what was missing
Customer may need to provide more information -
Check tool-guideline association:
Is the tool actually associated with the guideline?
Association must exist for tool to be considered
Common causes:
- The guideline did not match (so the tool was never considered)
- The tool was evaluated as DATA_ALREADY_IN_CONTEXT (the information is already available)
- Required parameters are not available from the conversation
Response Not Following Guideline
Symptom: A guideline matched, but the response does not follow it.
Diagnostic steps:
-
Verify guideline matched:
Check applied_guideline_ids in agent state
Was the guideline actually in the matched set? -
Check criticality:
HIGH criticality gets strongest enforcement
MEDIUM gets standard enforcement
LOW might not be strongly enforced -
Check ARQ artifacts:
Look at the structured output from generation
Did the LLM acknowledge the guideline?
What did it say about how to address it? -
Check for conflicting guidelines:
Another guideline might contradict this one
Check for suppression relationships -
Check composition mode:
In STRICT mode, only canned responses work
If no canned response matches, behavior is undefined
Common causes:
- The guideline criticality is set too low
- A conflicting guideline is taking precedence
- The action is vague (e.g., "be helpful" rather than a specific action)
- There is a composition mode mismatch
Journey Stuck at Wrong Node
Symptom: The journey is not progressing to the expected node.
Diagnostic steps:
-
Check current path:
journey_paths shows where you are
Is it at the expected node? -
Check transition conditions:
Edges may have conditions
Is the condition for the next edge met? -
Check node selection rationale:
Journey node selection batch explains decisions
Why was the current node selected over the next? -
Check for backtracking signals:
Customer might have said something that
triggered backtracking to an earlier node
Common causes:
- The transition condition is not met
- The customer response triggered backtracking
- Ambiguous node actions are causing incorrect selection
Hook-Based Debugging
Hooks let you inspect and log state at key points:
class DebugHook:
async def on_preparation_iteration_end(
self,
context: EngineContext,
iteration: int
) -> bool:
# Log matched guidelines after each iteration
for match in context.state.ordinary_guideline_matches:
logger.debug(f"Matched: {match.guideline.id} "
f"score={match.score} "
f"rationale={match.rationale}")
# Log tool events
for event in context.state.tool_events:
logger.debug(f"Tool: {event.tool_name} "
f"result={event.result}")
return True # Continue processing
async def on_generating_messages(
self,
context: EngineContext
) -> bool:
# Log what will be sent to message generation
logger.debug(f"Guidelines for generation: "
f"{len(context.state.ordinary_guideline_matches)}")
logger.debug(f"Tool insights: {context.state.tool_insights}")
return True
Useful Hook Points for Debugging
| Hook | What You Can Inspect |
|---|---|
on_preparation_iteration_start | Context before matching begins |
on_preparation_iteration_end | All matches and tool results for this iteration |
on_generating_messages | Final context being sent to generation |
on_messages_emitted | What was actually generated |
on_guideline_match | Individual guideline match details |
See Engine Extensions for hook implementation.
Diagnostic Checklist
When something goes wrong, work through this checklist:
□ What was expected behavior?
□ What actually happened?
□ Check trace: Did preparation complete normally?
□ Check guideline matching:
□ Were expected guidelines evaluated?
□ Did they match? With what rationale?
□ Were unexpected guidelines matched?
□ Check tool calling:
□ Were expected tools considered?
□ What were inference decisions?
□ Any CANNOT_RUN insights?
□ Check journey state:
□ Is the right journey active?
□ At the right node?
□ Any unexpected transitions?
□ Check message generation:
□ What guidelines were in context?
□ What do ARQ artifacts show?
□ Was criticality appropriate?
□ Check relationships:
□ Any suppression preventing match?
□ Any priority overrides?
Performance Debugging
If responses are too slow:
Identify Bottlenecks
Trace timing shows time per span:
- guideline_matching: 2.3s ← Slow matching
- tool_inference: 0.4s
- tool_execution: 1.1s ← Slow tool
- message_generation: 0.8s
Common Performance Issues
| Issue | Symptom | Solution |
|---|---|---|
| Too many guidelines | Extended matching time | Implement better journey scoping |
| Large batches | High latency per batch | Reduce the batch size |
| Slow tools | Extended tool execution time | Optimize the tool implementation or add caching |
| Many iterations | Multiple matching rounds | Review guideline dependencies for optimization |
Optimization Levers
- Journey scoping: Reduce the number of guidelines evaluated by scoping them to specific journeys.
- Batch size: Smaller batches enable more parallelism but require more LLM calls.
- Tool caching: Cache tool results that do not change frequently.
- max_iterations: Lower values produce faster responses but may miss some dynamically triggered guidelines.
What's Next
- Engine Extensions: Implementing custom hooks
- Response Lifecycle: Understanding the full flow
- Guideline Matching: Deep dive into matching behavior