Debugging

When an agent does not behave as expected, understanding what happened inside the engine becomes essential. This page explains how to trace engine behavior, interpret results, and debug common issues.

Understanding Engine Traces

Parlant uses OpenTelemetry-compatible tracing. Each response generates a trace with spans for major operations.

Trace Hierarchy

response_trace
├── preparation
│   ├── load_context_variables
│   ├── load_glossary_terms
│   └── load_capabilities
├── iteration_1
│   ├── guideline_matching
│   │   ├── journey_prediction
│   │   ├── batch_observational
│   │   ├── batch_actionable
│   │   └── relational_resolution
│   └── tool_calling
│       ├── tool_inference
│       └── tool_execution
├── iteration_2
│   └── (same structure)
├── message_generation
│   ├── arq_generation
│   └── message_emission
└── state_persistence

Accessing Traces

Console output: Set log level to DEBUG to see span information:

PARLANT_LOG_LEVEL=DEBUG parlant serve

OTLP export: Configure an OTLP endpoint to send traces to your observability platform (Jaeger, Honeycomb, etc.):

# In your configuration
otlp_endpoint = "http://localhost:4317"

Session events: Each response stores a trace_id in the agent state, linking session events to their execution trace.

Key Observability Points

Different phases expose different information. Here's what to look for:

Guideline Matching Phase

What to Check	Where to Find It	What It Tells You
Which guidelines matched	`matched_guidelines` in iteration state	Guidelines that will influence the response
Match scores	`score` field on each match (0-10)	Confidence in the match
Match rationale	`rationale` field on each match	Why the LLM decided it matched
Which guidelines were evaluated	Trace spans per batch	What was actually sent to LLM
Which guidelines were pruned	Journey prediction span	What was excluded before LLM evaluation

Journey State Phase

What to Check	Where to Find It	What It Tells You
Active journeys	`journey_paths` in response state	Which journeys are in progress
Current position	Path array for each journey	The current node position
Transitions	Changes in path between iterations	How the journey progressed

Tool Calling Phase

What to Check	Where to Find It	What It Tells You
Tools considered	`tool_enabled_guideline_matches`	Which tools were candidates
Inference decisions	Tool inference span	NEEDS_TO_RUN, DATA_ALREADY_IN_CONTEXT, or CANNOT_RUN
Execution results	`tool_events` in response state	What tools returned
Blocked tools	`tool_insights`	What couldn't run and why

Message Generation Phase

What to Check	Where to Find It	What It Tells You
Guidelines used	Generation context	What instructions the LLM saw
ARQ artifacts	Structured output before message	How LLM reasoned about guidelines
Applied guidelines	`applied_guideline_ids` in agent state	Which guidelines affected this response

Debugging Common Issues

Guideline Not Matching

Symptom: A guideline that was expected to apply did not match.

Diagnostic steps:

Check if it was evaluated:

Look in trace: Was the guideline in any batch?
If NO → It was pruned (journey prediction excluded it)
If YES → Continue to step 2

Check criticality:

Low-criticality guidelines may be aggressively pruned
Consider raising criticality if important

Check condition wording:

Is the condition clearly met by the conversation?
Try rewording to be more explicit

Check match rationale:

If evaluated, what did the LLM say?
The rationale explains why it decided "no match"

Check journey scope:

Is the guideline scoped to a journey?
Is that journey currently active?

Common causes:

The condition is overly specific (e.g., "customer says exactly 'I want to return this'")
The condition uses domain jargon that the LLM cannot interpret correctly
The journey is not active, causing journey-scoped guidelines to be excluded
The guideline was pruned by journey prediction

Guideline Matching When It Shouldn't

Symptom: A guideline matched when the condition is not actually met.

Diagnostic steps:

Read the match rationale:

What reasoning led to the match?
Often reveals interpretation issues

Check for ambiguous conditions:

"Customer seems frustrated" - what does "seems" mean?
"Customer mentions returns" - mentioning vs. wanting

Check for overly broad conditions:

"Customer asks a question" - too broad
"Customer asks about products" - still broad
"Customer asks about product pricing" - better

Common causes:

The condition is too vague
The LLM is being overly generous in its interpretation
Similar conditions on different guidelines are causing confusion

Tool Not Called

Symptom: A tool that should have been called was not executed.

Diagnostic steps:

Check if guideline matched:

The tool's associated guideline must match first
If guideline didn't match, tool won't be considered

Check inference decision:

Look in tool_inference span
Was it DATA_ALREADY_IN_CONTEXT? Data might be known
Was it CANNOT_RUN? Parameters were missing

Check tool_insights:

If CANNOT_RUN, insights explain what was missing
Customer may need to provide more information

Check tool-guideline association:

Is the tool actually associated with the guideline?
Association must exist for tool to be considered

Common causes:

The guideline did not match (so the tool was never considered)
The tool was evaluated as DATA_ALREADY_IN_CONTEXT (the information is already available)
Required parameters are not available from the conversation

Response Not Following Guideline

Symptom: A guideline matched, but the response does not follow it.

Diagnostic steps:

Verify guideline matched:

Check applied_guideline_ids in agent state
Was the guideline actually in the matched set?

Check criticality:

HIGH criticality gets strongest enforcement
MEDIUM gets standard enforcement
LOW might not be strongly enforced

Check ARQ artifacts:

Look at the structured output from generation
Did the LLM acknowledge the guideline?
What did it say about how to address it?

Check for conflicting guidelines:

Another guideline might contradict this one
Check for suppression relationships

Check composition mode:

In STRICT mode, only canned responses work
If no canned response matches, behavior is undefined

Common causes:

The guideline criticality is set too low
A conflicting guideline is taking precedence
The action is vague (e.g., "be helpful" rather than a specific action)
There is a composition mode mismatch

Journey Stuck at Wrong Node

Symptom: The journey is not progressing to the expected node.

Diagnostic steps:

Check current path:

journey_paths shows where you are
Is it at the expected node?

Check transition conditions:

Edges may have conditions
Is the condition for the next edge met?

Check node selection rationale:

Journey node selection batch explains decisions
Why was the current node selected over the next?

Check for backtracking signals:

Customer might have said something that
triggered backtracking to an earlier node

Common causes:

The transition condition is not met
The customer response triggered backtracking
Ambiguous node actions are causing incorrect selection

Hook-Based Debugging

Hooks let you inspect and log state at key points:

class DebugHook:
    async def on_preparation_iteration_end(
        self,
        context: EngineContext,
        iteration: int
    ) -> bool:
        # Log matched guidelines after each iteration
        for match in context.state.ordinary_guideline_matches:
            logger.debug(f"Matched: {match.guideline.id} "
                        f"score={match.score} "
                        f"rationale={match.rationale}")

        # Log tool events
        for event in context.state.tool_events:
            logger.debug(f"Tool: {event.tool_name} "
                        f"result={event.result}")

        return True  # Continue processing

    async def on_generating_messages(
        self,
        context: EngineContext
    ) -> bool:
        # Log what will be sent to message generation
        logger.debug(f"Guidelines for generation: "
                    f"{len(context.state.ordinary_guideline_matches)}")
        logger.debug(f"Tool insights: {context.state.tool_insights}")
        return True

Useful Hook Points for Debugging

Hook	What You Can Inspect
`on_preparation_iteration_start`	Context before matching begins
`on_preparation_iteration_end`	All matches and tool results for this iteration
`on_generating_messages`	Final context being sent to generation
`on_messages_emitted`	What was actually generated
`on_guideline_match`	Individual guideline match details

See Engine Extensions for hook implementation.

Diagnostic Checklist

When something goes wrong, work through this checklist:

□ What was expected behavior?
□ What actually happened?
□ Check trace: Did preparation complete normally?
□ Check guideline matching:
  □ Were expected guidelines evaluated?
  □ Did they match? With what rationale?
  □ Were unexpected guidelines matched?
□ Check tool calling:
  □ Were expected tools considered?
  □ What were inference decisions?
  □ Any CANNOT_RUN insights?
□ Check journey state:
  □ Is the right journey active?
  □ At the right node?
  □ Any unexpected transitions?
□ Check message generation:
  □ What guidelines were in context?
  □ What do ARQ artifacts show?
  □ Was criticality appropriate?
□ Check relationships:
  □ Any suppression preventing match?
  □ Any priority overrides?

Performance Debugging

If responses are too slow:

Identify Bottlenecks

Trace timing shows time per span:
- guideline_matching: 2.3s ← Slow matching
- tool_inference: 0.4s
- tool_execution: 1.1s ← Slow tool
- message_generation: 0.8s

Common Performance Issues

Issue	Symptom	Solution
Too many guidelines	Extended matching time	Implement better journey scoping
Large batches	High latency per batch	Reduce the batch size
Slow tools	Extended tool execution time	Optimize the tool implementation or add caching
Many iterations	Multiple matching rounds	Review guideline dependencies for optimization

Optimization Levers

Journey scoping: Reduce the number of guidelines evaluated by scoping them to specific journeys.
Batch size: Smaller batches enable more parallelism but require more LLM calls.
Tool caching: Cache tool results that do not change frequently.
max_iterations: Lower values produce faster responses but may miss some dynamically triggered guidelines.

What's Next

Engine Extensions: Implementing custom hooks
Response Lifecycle: Understanding the full flow
Guideline Matching: Deep dive into matching behavior

Understanding Engine Traces​

Trace Hierarchy​

Accessing Traces​

Key Observability Points​

Guideline Matching Phase​

Journey State Phase​

Tool Calling Phase​

Message Generation Phase​

Debugging Common Issues​

Guideline Not Matching​

Guideline Matching When It Shouldn't​

Tool Not Called​

Response Not Following Guideline​

Journey Stuck at Wrong Node​

Hook-Based Debugging​

Useful Hook Points for Debugging​

Diagnostic Checklist​

Performance Debugging​

Identify Bottlenecks​

Common Performance Issues​

Optimization Levers​

What's Next​

Understanding Engine Traces

Trace Hierarchy

Accessing Traces

Key Observability Points

Guideline Matching Phase

Journey State Phase

Tool Calling Phase

Message Generation Phase

Debugging Common Issues

Guideline Not Matching

Guideline Matching When It Shouldn't

Tool Not Called

Response Not Following Guideline

Journey Stuck at Wrong Node

Hook-Based Debugging

Useful Hook Points for Debugging

Diagnostic Checklist

Performance Debugging

Identify Bottlenecks

Common Performance Issues

Optimization Levers

What's Next