分布式追踪在 AI 应用里的深度实践——追踪一次完整的 Agent 执行

老张2026/4/30大约 9 分钟

分布式追踪在 AI 应用里的深度实践——追踪一次完整的 Agent 执行

我第一次真正理解分布式追踪的价值，是在排查一个 Agent 的奇怪行为的时候。

用户反馈说，让 Agent 帮忙整理一份竞品分析报告，结果 Agent 执行了大概 15 分钟，最后输出了一份完全不相关的内容——内容是关于餐厅推荐的。明显是跑偏了，但我们不知道哪个步骤出了问题。

日志里有一堆 INFO 日志，但全是各个工具调用的结果，没有上下文关联，无法还原 Agent 的决策过程。我们花了半天时间，逐条读日志，才大概拼出了这个 Agent 在某个搜索工具调用时，因为关键词歧义，检索结果跑偏了，之后的每一步都在错误的方向上越走越远。

这种问题，如果有完整的分布式追踪，5 分钟就能定位。

Agent 追踪和普通服务追踪的根本区别

普通微服务的追踪很直观：一个 HTTP 请求进来，触发几个下游调用，形成一棵调用树。整个结构是固定的，你大概知道会有哪些 Span。

AI Agent 的追踪完全不同，有几个核心差异：

差异一：执行路径是动态的。 Agent 决定下一步做什么，是根据当前状态和工具调用结果实时决策的，不是预定好的。同一个 Agent 处理不同的输入，执行路径可能完全不同：有的请求 3 步完成，有的需要 12 步。追踪系统必须能表达这种动态结构。

差异二：有隐式的"思考"过程。 在工具调用和工具调用之间，Agent 有一个 LLM 推理步骤，决定下一步调用什么工具、参数是什么。这个"思考"过程没有显式的函数调用，如果不主动记录，trace 里就是一个黑盒。

差异三：中间状态非常重要。 对于普通服务，中间状态（某个变量的值）通常不需要记录在 trace 里。对于 Agent，工具调用的返回值（搜索结果、代码执行结果）对理解 Agent 行为至关重要。如果工具返回了错误的信息，Agent 的后续行为就会基于错误信息做出，trace 里没有这个信息就无法诊断。

差异四：跨请求的上下文。 Agent 往往需要维护多轮对话的历史，一个新的用户输入要结合之前 N 轮的对话历史来处理。trace 需要能关联同一个 session 的多次请求。

Agent Trace 的结构设计

一个完整的 Agent 执行 Trace 应该是这样的结构：

[Trace: 用户请求 - "帮我整理竞品分析报告"]
├── [Span: agent.run] (总耗时 12.3s)
│   ├── [Span: agent.iteration.1] (1.2s)
│   │   ├── [Span: llm.think] - 决策：调用 web_search 工具 (0.8s)
│   │   │   ├── gen_ai.request.model = "gpt-4o"
│   │   │   ├── gen_ai.usage.input_tokens = 856
│   │   │   ├── gen_ai.usage.output_tokens = 124
│   │   │   └── agent.thought = "我需要先搜索主要竞品的信息..."
│   │   └── [Span: tool.web_search] (0.4s)
│   │       ├── tool.name = "web_search"
│   │       ├── tool.input = {"query": "AI 写作助手竞品分析 2024"}
│   │       └── tool.output.summary = "找到 15 条结果..."
│   ├── [Span: agent.iteration.2] (2.1s)
│   │   ├── [Span: llm.think] - 决策：调用 extract_info 工具 (1.6s)
│   │   └── [Span: tool.extract_info] (0.5s)
│   │       ├── tool.input = {文档 URL}
│   │       └── tool.output.extracted_fields = [...]
│   └── [Span: agent.final_answer] (0.3s)
│       ├── llm.final_answer = "以下是竞品分析..."
│       └── agent.iterations_count = 6

每个 Span 的职责：

agent.run：整个 Agent 执行的根 Span，记录总耗时、总迭代次数、最终状态
agent.iteration.N：每次迭代的 Span，包含 LLM 推理 + 工具调用
llm.think：LLM 推理的 Span，记录 Token 消耗和"思考"内容
tool.*：工具调用 Span，记录输入输出

LangChain4j Agent 执行的 Trace 注入

LangChain4j 的 Agent 基于 AiServices 和 Tool 机制，追踪注入需要在几个关键点切入：

Step 1：Agent 执行监听器

// 自定义的 Agent 执行追踪拦截器
@Component
public class AgentExecutionTracer {
    
    private final Tracer tracer;
    private final MeterRegistry meterRegistry;
    
    // ThreadLocal 存储每次 Agent 执行的上下文
    private final ThreadLocal<AgentTraceContext> traceContext = new ThreadLocal<>();
    
    /**
     * Agent 执行开始时调用
     */
    public void onAgentStart(String agentId, String sessionId, String userInput) {
        Span rootSpan = tracer.nextSpan()
            .name("agent.run")
            .tag("agent.id", agentId)
            .tag("agent.session_id", sessionId)
            .tag("agent.input_length", String.valueOf(userInput.length()))
            .start();
        
        AgentTraceContext ctx = AgentTraceContext.builder()
            .rootSpan(rootSpan)
            .rootScope(tracer.withSpan(rootSpan))
            .agentId(agentId)
            .sessionId(sessionId)
            .startTime(System.currentTimeMillis())
            .iterationCount(0)
            .build();
        
        traceContext.set(ctx);
        
        log.info("Agent execution started: agentId={}, sessionId={}, traceId={}", 
            agentId, sessionId, rootSpan.context().traceId());
    }
    
    /**
     * 每次 LLM 推理开始时调用（决策阶段）
     */
    public Span onLLMThinkStart(int iterationNumber) {
        AgentTraceContext ctx = traceContext.get();
        ctx.setIterationCount(iterationNumber);
        
        // 先创建 iteration Span
        Span iterationSpan = tracer.nextSpan()
            .name("agent.iteration." + iterationNumber)
            .tag("agent.iteration", String.valueOf(iterationNumber))
            .start();
        
        Span thinkSpan = tracer.nextSpan()
            .name("llm.think")
            .tag("gen_ai.operation.name", "chat")
            .tag("agent.iteration", String.valueOf(iterationNumber))
            .start();
        
        ctx.setCurrentIterationSpan(iterationSpan);
        ctx.setCurrentIterationScope(tracer.withSpan(iterationSpan));
        ctx.setCurrentThinkSpan(thinkSpan);
        ctx.setCurrentThinkScope(tracer.withSpan(thinkSpan));
        
        return thinkSpan;
    }
    
    /**
     * LLM 推理结束时调用
     */
    public void onLLMThinkEnd(int inputTokens, int outputTokens, 
                               String thought, String nextAction) {
        AgentTraceContext ctx = traceContext.get();
        Span thinkSpan = ctx.getCurrentThinkSpan();
        
        thinkSpan.tag("gen_ai.usage.input_tokens", String.valueOf(inputTokens));
        thinkSpan.tag("gen_ai.usage.output_tokens", String.valueOf(outputTokens));
        
        // 记录 Agent 的"思考"内容（生产环境可能要过滤 PII 后再记录）
        if (thought != null && thought.length() < 500) {
            thinkSpan.tag("agent.thought", thought);
        }
        
        thinkSpan.tag("agent.next_action", nextAction);
        
        // 关闭 think scope 但不关闭 iteration scope（还有工具调用）
        ctx.getCurrentThinkScope().close();
        thinkSpan.end();
        
        // 记录 Token 成本到 Prometheus
        meterRegistry.counter("agent.llm.tokens",
            "type", "input",
            "agent_id", ctx.getAgentId()
        ).increment(inputTokens);
        meterRegistry.counter("agent.llm.tokens",
            "type", "output",
            "agent_id", ctx.getAgentId()
        ).increment(outputTokens);
    }
    
    /**
     * 工具调用开始时调用
     */
    public Span onToolCallStart(String toolName, String callId, Object toolInput) {
        AgentTraceContext ctx = traceContext.get();
        
        Span toolSpan = tracer.nextSpan()
            .name("tool." + toolName)
            .tag("tool.name", toolName)
            .tag("tool.call_id", callId)
            .start();
        
        try {
            String inputJson = objectMapper.writeValueAsString(toolInput);
            // 输入超长就截断，避免 trace 太大
            toolSpan.tag("tool.input", inputJson.length() > 1000 
                ? inputJson.substring(0, 1000) + "..." 
                : inputJson);
        } catch (Exception e) {
            toolSpan.tag("tool.input", toolInput.toString());
        }
        
        ctx.setCurrentToolSpan(toolSpan);
        ctx.setCurrentToolScope(tracer.withSpan(toolSpan));
        
        return toolSpan;
    }
    
    /**
     * 工具调用结束时调用
     */
    public void onToolCallEnd(String toolName, Object toolOutput, boolean success) {
        AgentTraceContext ctx = traceContext.get();
        Span toolSpan = ctx.getCurrentToolSpan();
        
        toolSpan.tag("tool.success", String.valueOf(success));
        
        if (success && toolOutput != null) {
            try {
                String outputJson = objectMapper.writeValueAsString(toolOutput);
                // 工具输出通常很长，只记录摘要
                toolSpan.tag("tool.output_length", String.valueOf(outputJson.length()));
                toolSpan.tag("tool.output.summary", 
                    outputJson.length() > 200 
                        ? outputJson.substring(0, 200) + "..." 
                        : outputJson
                );
            } catch (Exception e) {
                toolSpan.tag("tool.output", toolOutput.toString());
            }
        } else if (!success) {
            toolSpan.tag("error", "true");
        }
        
        ctx.getCurrentToolScope().close();
        toolSpan.end();
        
        // 关闭 iteration span
        ctx.getCurrentIterationScope().close();
        ctx.getCurrentIterationSpan().end();
        
        // 记录工具调用成功率
        meterRegistry.counter("agent.tool.calls",
            "tool_name", toolName,
            "success", String.valueOf(success),
            "agent_id", ctx.getAgentId()
        ).increment();
    }
    
    /**
     * Agent 执行结束时调用
     */
    public void onAgentEnd(String finalAnswer, boolean success) {
        AgentTraceContext ctx = traceContext.get();
        Span rootSpan = ctx.getRootSpan();
        
        long totalDuration = System.currentTimeMillis() - ctx.getStartTime();
        
        rootSpan.tag("agent.iterations_total", String.valueOf(ctx.getIterationCount()));
        rootSpan.tag("agent.total_duration_ms", String.valueOf(totalDuration));
        rootSpan.tag("agent.success", String.valueOf(success));
        
        if (success && finalAnswer != null) {
            rootSpan.tag("agent.output_length", String.valueOf(finalAnswer.length()));
        }
        
        ctx.getRootScope().close();
        rootSpan.end();
        
        // 清理 ThreadLocal
        traceContext.remove();
        
        // Prometheus 指标
        meterRegistry.timer("agent.execution.duration",
            "agent_id", ctx.getAgentId(),
            "success", String.valueOf(success)
        ).record(totalDuration, TimeUnit.MILLISECONDS);
    }
}

Step 2：把追踪器注入到 Agent 的工具执行循环

LangChain4j 的 AiServices 不直接暴露执行循环，需要用 ChatMemoryProvider + 自定义 ToolProvider 来切入：

@Configuration
public class AgentConfig {
    
    @Autowired
    private AgentExecutionTracer agentTracer;
    
    @Bean
    public CompetitorAnalysisAgent competitorAnalysisAgent(
        ChatLanguageModel chatModel,
        List<Object> tools) {
        
        // 用装饰器模式包装 ChatLanguageModel，在每次 LLM 调用前后触发追踪
        ChatLanguageModel tracedModel = new TracedChatLanguageModel(chatModel, agentTracer);
        
        return AiServices.builder(CompetitorAnalysisAgent.class)
            .chatLanguageModel(tracedModel)
            .tools(tools)
            .chatMemoryProvider(memId -> MessageWindowChatMemory.withMaxMessages(20))
            // 注册工具调用监听器
            .toolExecutionResultHandler((result, request) -> {
                agentTracer.onToolCallEnd(
                    request.toolName(),
                    result.text(),
                    !result.text().startsWith("Error:")
                );
            })
            .build();
    }
}

// 包装模型，注入追踪逻辑
public class TracedChatLanguageModel implements ChatLanguageModel {
    
    private final ChatLanguageModel delegate;
    private final AgentExecutionTracer tracer;
    private int iterationCount = 0;
    
    @Override
    public Response<AiMessage> generate(List<ChatMessage> messages, ToolSpecification... toolSpecifications) {
        int currentIteration = ++iterationCount;
        Span thinkSpan = tracer.onLLMThinkStart(currentIteration);
        
        try {
            Response<AiMessage> response = delegate.generate(messages, toolSpecifications);
            
            int inputTokens = response.tokenUsage() != null 
                ? response.tokenUsage().inputTokenCount() : 0;
            int outputTokens = response.tokenUsage() != null 
                ? response.tokenUsage().outputTokenCount() : 0;
            
            // 提取 Agent 的思考内容（如果模型支持）
            String thought = extractThought(response.content().text());
            String nextAction = response.content().hasToolExecutionRequests() 
                ? response.content().toolExecutionRequests().get(0).name()
                : "final_answer";
            
            tracer.onLLMThinkEnd(inputTokens, outputTokens, thought, nextAction);
            
            // 如果有工具调用，记录工具调用开始
            if (response.content().hasToolExecutionRequests()) {
                var toolReq = response.content().toolExecutionRequests().get(0);
                tracer.onToolCallStart(
                    toolReq.name(),
                    toolReq.id(),
                    toolReq.arguments()
                );
            }
            
            return response;
        } catch (Exception e) {
            thinkSpan.tag("error", "true");
            thinkSpan.tag("error.message", e.getMessage());
            throw e;
        }
    }
    
    private String extractThought(String content) {
        // 有些模型会在 <thinking> 标签里输出思考过程
        if (content.contains("<thinking>")) {
            int start = content.indexOf("<thinking>") + 10;
            int end = content.indexOf("</thinking>");
            if (end > start) {
                return content.substring(start, end).trim();
            }
        }
        return null;
    }
}