Skip to main content

Debugging

When an agent does not behave as expected, understanding what happened inside the engine becomes essential. This page explains how to trace engine behavior, interpret results, and debug common issues.

Understanding Engine Traces

Parlant uses OpenTelemetry-compatible tracing. Each response generates a trace with spans for major operations.

Trace Hierarchy

response_trace
├── preparation
│ ├── load_context_variables
│ ├── load_glossary_terms
│ └── load_capabilities
├── iteration_1
│ ├── guideline_matching
│ │ ├── journey_prediction
│ │ ├── batch_observational
│ │ ├── batch_actionable
│ │ └── relational_resolution
│ └── tool_calling
│ ├── tool_inference
│ └── tool_execution
├── iteration_2
│ └── (same structure)
├── message_generation
│ ├── arq_generation
│ └── message_emission
└── state_persistence

Accessing Traces

Console output: Set log level to DEBUG to see span information:

PARLANT_LOG_LEVEL=DEBUG parlant serve

OTLP export: Configure an OTLP endpoint to send traces to your observability platform (Jaeger, Honeycomb, etc.):

# In your configuration
otlp_endpoint = "http://localhost:4317"

Session events: Each response stores a trace_id in the agent state, linking session events to their execution trace.

Key Observability Points

Different phases expose different information. Here's what to look for:

Guideline Matching Phase

What to CheckWhere to Find ItWhat It Tells You
Which guidelines matchedmatched_guidelines in iteration stateGuidelines that will influence the response
Match scoresscore field on each match (0-10)Confidence in the match
Match rationalerationale field on each matchWhy the LLM decided it matched
Which guidelines were evaluatedTrace spans per batchWhat was actually sent to LLM
Which guidelines were prunedJourney prediction spanWhat was excluded before LLM evaluation

Journey State Phase

What to CheckWhere to Find ItWhat It Tells You
Active journeysjourney_paths in response stateWhich journeys are in progress
Current positionPath array for each journeyThe current node position
TransitionsChanges in path between iterationsHow the journey progressed

Tool Calling Phase

What to CheckWhere to Find ItWhat It Tells You
Tools consideredtool_enabled_guideline_matchesWhich tools were candidates
Inference decisionsTool inference spanNEEDS_TO_RUN, DATA_ALREADY_IN_CONTEXT, or CANNOT_RUN
Execution resultstool_events in response stateWhat tools returned
Blocked toolstool_insightsWhat couldn't run and why

Message Generation Phase

What to CheckWhere to Find ItWhat It Tells You
Guidelines usedGeneration contextWhat instructions the LLM saw
ARQ artifactsStructured output before messageHow LLM reasoned about guidelines
Applied guidelinesapplied_guideline_ids in agent stateWhich guidelines affected this response

Debugging Common Issues

Guideline Not Matching

Symptom: A guideline that was expected to apply did not match.

Diagnostic steps:

  1. Check if it was evaluated:

    Look in trace: Was the guideline in any batch?
    If NO → It was pruned (journey prediction excluded it)
    If YES → Continue to step 2
  2. Check criticality:

    Low-criticality guidelines may be aggressively pruned
    Consider raising criticality if important
  3. Check condition wording:

    Is the condition clearly met by the conversation?
    Try rewording to be more explicit
  4. Check match rationale:

    If evaluated, what did the LLM say?
    The rationale explains why it decided "no match"
  5. Check journey scope:

    Is the guideline scoped to a journey?
    Is that journey currently active?

Common causes:

  • The condition is overly specific (e.g., "customer says exactly 'I want to return this'")
  • The condition uses domain jargon that the LLM cannot interpret correctly
  • The journey is not active, causing journey-scoped guidelines to be excluded
  • The guideline was pruned by journey prediction

Guideline Matching When It Shouldn't

Symptom: A guideline matched when the condition is not actually met.

Diagnostic steps:

  1. Read the match rationale:

    What reasoning led to the match?
    Often reveals interpretation issues
  2. Check for ambiguous conditions:

    "Customer seems frustrated" - what does "seems" mean?
    "Customer mentions returns" - mentioning vs. wanting
  3. Check for overly broad conditions:

    "Customer asks a question" - too broad
    "Customer asks about products" - still broad
    "Customer asks about product pricing" - better

Common causes:

  • The condition is too vague
  • The LLM is being overly generous in its interpretation
  • Similar conditions on different guidelines are causing confusion

Tool Not Called

Symptom: A tool that should have been called was not executed.

Diagnostic steps:

  1. Check if guideline matched:

    The tool's associated guideline must match first
    If guideline didn't match, tool won't be considered
  2. Check inference decision:

    Look in tool_inference span
    Was it DATA_ALREADY_IN_CONTEXT? Data might be known
    Was it CANNOT_RUN? Parameters were missing
  3. Check tool_insights:

    If CANNOT_RUN, insights explain what was missing
    Customer may need to provide more information
  4. Check tool-guideline association:

    Is the tool actually associated with the guideline?
    Association must exist for tool to be considered

Common causes:

  • The guideline did not match (so the tool was never considered)
  • The tool was evaluated as DATA_ALREADY_IN_CONTEXT (the information is already available)
  • Required parameters are not available from the conversation

Response Not Following Guideline

Symptom: A guideline matched, but the response does not follow it.

Diagnostic steps:

  1. Verify guideline matched:

    Check applied_guideline_ids in agent state
    Was the guideline actually in the matched set?
  2. Check criticality:

    HIGH criticality gets strongest enforcement
    MEDIUM gets standard enforcement
    LOW might not be strongly enforced
  3. Check ARQ artifacts:

    Look at the structured output from generation
    Did the LLM acknowledge the guideline?
    What did it say about how to address it?
  4. Check for conflicting guidelines:

    Another guideline might contradict this one
    Check for suppression relationships
  5. Check composition mode:

    In STRICT mode, only canned responses work
    If no canned response matches, behavior is undefined

Common causes:

  • The guideline criticality is set too low
  • A conflicting guideline is taking precedence
  • The action is vague (e.g., "be helpful" rather than a specific action)
  • There is a composition mode mismatch

Journey Stuck at Wrong Node

Symptom: The journey is not progressing to the expected node.

Diagnostic steps:

  1. Check current path:

    journey_paths shows where you are
    Is it at the expected node?
  2. Check transition conditions:

    Edges may have conditions
    Is the condition for the next edge met?
  3. Check node selection rationale:

    Journey node selection batch explains decisions
    Why was the current node selected over the next?
  4. Check for backtracking signals:

    Customer might have said something that
    triggered backtracking to an earlier node

Common causes:

  • The transition condition is not met
  • The customer response triggered backtracking
  • Ambiguous node actions are causing incorrect selection

Hook-Based Debugging

Hooks let you inspect and log state at key points:

class DebugHook:
async def on_preparation_iteration_end(
self,
context: EngineContext,
iteration: int
) -> bool:
# Log matched guidelines after each iteration
for match in context.state.ordinary_guideline_matches:
logger.debug(f"Matched: {match.guideline.id} "
f"score={match.score} "
f"rationale={match.rationale}")

# Log tool events
for event in context.state.tool_events:
logger.debug(f"Tool: {event.tool_name} "
f"result={event.result}")

return True # Continue processing

async def on_generating_messages(
self,
context: EngineContext
) -> bool:
# Log what will be sent to message generation
logger.debug(f"Guidelines for generation: "
f"{len(context.state.ordinary_guideline_matches)}")
logger.debug(f"Tool insights: {context.state.tool_insights}")
return True

Useful Hook Points for Debugging

HookWhat You Can Inspect
on_preparation_iteration_startContext before matching begins
on_preparation_iteration_endAll matches and tool results for this iteration
on_generating_messagesFinal context being sent to generation
on_messages_emittedWhat was actually generated
on_guideline_matchIndividual guideline match details

See Engine Extensions for hook implementation.

Diagnostic Checklist

When something goes wrong, work through this checklist:

□ What was expected behavior?
□ What actually happened?
□ Check trace: Did preparation complete normally?
□ Check guideline matching:
□ Were expected guidelines evaluated?
□ Did they match? With what rationale?
□ Were unexpected guidelines matched?
□ Check tool calling:
□ Were expected tools considered?
□ What were inference decisions?
□ Any CANNOT_RUN insights?
□ Check journey state:
□ Is the right journey active?
□ At the right node?
□ Any unexpected transitions?
□ Check message generation:
□ What guidelines were in context?
□ What do ARQ artifacts show?
□ Was criticality appropriate?
□ Check relationships:
□ Any suppression preventing match?
□ Any priority overrides?

Performance Debugging

If responses are too slow:

Identify Bottlenecks

Trace timing shows time per span:
- guideline_matching: 2.3s ← Slow matching
- tool_inference: 0.4s
- tool_execution: 1.1s ← Slow tool
- message_generation: 0.8s

Common Performance Issues

IssueSymptomSolution
Too many guidelinesExtended matching timeImplement better journey scoping
Large batchesHigh latency per batchReduce the batch size
Slow toolsExtended tool execution timeOptimize the tool implementation or add caching
Many iterationsMultiple matching roundsReview guideline dependencies for optimization

Optimization Levers

  1. Journey scoping: Reduce the number of guidelines evaluated by scoping them to specific journeys.
  2. Batch size: Smaller batches enable more parallelism but require more LLM calls.
  3. Tool caching: Cache tool results that do not change frequently.
  4. max_iterations: Lower values produce faster responses but may miss some dynamically triggered guidelines.

What's Next