第2130篇：LLM应用的可观测性——当AI出了问题，你能看到什么

老张2026/4/30大约 7 分钟

第2130篇：LLM应用的可观测性——当AI出了问题，你能看到什么

适读人群：负责LLM应用运维和质量监控的工程师 | 阅读时长：约19分钟 | 核心价值：建立覆盖日志、追踪、指标的LLM可观测性体系，实现"出了问题能快速定位"的工程能力

"用户反馈AI给的答案不对，但我完全不知道AI当时收到了什么输入、检索了什么文档、Prompt是怎么拼的。"

这是没有可观测性的痛苦。普通软件出了问题，看日志就能定位。但LLM应用的"出问题"更复杂：可能是Prompt组装不对，可能是RAG检索了错误的文档，可能是模型输出被解析错了。如果没有完整的追踪信息，调试就是在黑盒里猜。

可观测性（Observability）是生产级AI应用的基础设施，不是可选功能。

LLM应用的可观测性挑战

/**
 * LLM应用的可观测性特殊性
 * 
 * ===== 和传统软件的区别 =====
 * 
 * 传统软件：
 * - 确定性输出（同样输入→同样输出）
 * - 失败有明确错误码
 * - 性能主要看延迟和错误率
 * 
 * LLM应用：
 * - 输出有随机性（同样输入→不同输出）
 * - 失败是"软失败"（输出了但质量差，没有exception）
 * - 性能还要看输出质量（满意率、准确性）
 * 
 * ===== 需要观测的特殊指标 =====
 * 
 * Token指标：
 * - 每次请求消耗的input/output tokens
 * - token使用量分布（P50/P90/P99）
 * - 累计费用趋势
 * 
 * 质量指标：
 * - 用户满意率（显式/隐式反馈）
 * - 幻觉检测率（如果有）
 * - 拒答率（AI说"我不知道"的比例）
 * 
 * RAG指标：
 * - 检索召回率
 * - 检索延迟
 * - 上下文利用率（检索的内容有多少被引用了）
 * 
 * 追踪需求：
 * - 一次用户请求经历了哪些步骤？
 * - 每步用了多长时间？
 * - 中间状态是什么？
 */

分布式追踪集成

/**
 * LLM请求追踪器
 * 
 * 用OpenTelemetry标准记录完整的请求追踪
 * 可以接入Jaeger、Zipkin、Datadog等追踪系统
 */
@Component
@RequiredArgsConstructor
@Slf4j
public class LlmTracer {
    
    private final Tracer tracer;  // OpenTelemetry Tracer
    
    /**
     * 创建LLM请求的根Span
     */
    public Span startRequestTrace(String userId, String feature, String requestId) {
        
        Span span = tracer.spanBuilder("llm.request")
            .setSpanKind(SpanKind.SERVER)
            .startSpan();
        
        span.setAttribute("user.id", userId);
        span.setAttribute("llm.feature", feature);
        span.setAttribute("request.id", requestId);
        span.setAttribute("llm.start_time", System.currentTimeMillis());
        
        return span;
    }
    
    /**
     * 记录RAG检索过程
     */
    public Span traceRetrieval(Span parentSpan, String query, int topK) {
        
        Context parentContext = Context.current().with(parentSpan);
        
        Span retrievalSpan = tracer.spanBuilder("rag.retrieval")
            .setParent(parentContext)
            .startSpan();
        
        retrievalSpan.setAttribute("rag.query", query);
        retrievalSpan.setAttribute("rag.top_k", topK);
        
        return retrievalSpan;
    }
    
    /**
     * 记录检索结果
     */
    public void recordRetrievalResult(Span span, List<RetrievedDocument> docs, long latencyMs) {
        span.setAttribute("rag.result_count", docs.size());
        span.setAttribute("rag.latency_ms", latencyMs);
        
        if (!docs.isEmpty()) {
            span.setAttribute("rag.top_score", docs.get(0).getScore());
            span.setAttribute("rag.min_score", docs.get(docs.size() - 1).getScore());
        }
        
        span.end();
    }
    
    /**
     * 记录LLM调用
     */
    public Span traceLlmCall(Span parentSpan, String modelId, int inputTokens) {
        
        Context parentContext = Context.current().with(parentSpan);
        
        Span llmSpan = tracer.spanBuilder("llm.model_call")
            .setParent(parentContext)
            .startSpan();
        
        llmSpan.setAttribute("llm.model", modelId);
        llmSpan.setAttribute("llm.input_tokens", inputTokens);
        
        return llmSpan;
    }
    
    /**
     * 记录LLM调用结果
     */
    public void recordLlmCallResult(
            Span span, int outputTokens, String finishReason, long latencyMs) {
        span.setAttribute("llm.output_tokens", outputTokens);
        span.setAttribute("llm.finish_reason", finishReason);
        span.setAttribute("llm.latency_ms", latencyMs);
        span.setAttribute("llm.cost_usd", estimateCost(
            (int)span.getAttribute(LongKey.create("llm.input_tokens")), outputTokens));
        span.end();
    }
    
    /**
     * 记录错误
     */
    public void recordError(Span span, Throwable error) {
        span.recordException(error);
        span.setStatus(StatusCode.ERROR, error.getMessage());
        span.end();
    }
    
    private double estimateCost(int inputTokens, int outputTokens) {
        return inputTokens * 5.0 / 1_000_000 + outputTokens * 15.0 / 1_000_000;
    }
}

结构化日志体系

/**
 * LLM专用结构化日志
 * 
 * 普通日志对LLM应用不够用
 * 需要记录Prompt、检索结果、模型输出等中间状态
 * 
 * 日志分级策略：
 * - DEBUG：完整的Prompt和输出（开发/测试环境）
 * - INFO：关键步骤摘要（生产环境）
 * - WARN：降级、重试、质量异常
 * - ERROR：系统故障
 */
@Service
@RequiredArgsConstructor
@Slf4j
public class LlmStructuredLogger {
    
    private final ObjectMapper mapper;
    
    /**
     * 记录请求开始
     */
    public void logRequestStart(
            String requestId, String userId, String feature, String userMessage) {
        
        Map<String, Object> entry = new LinkedHashMap<>();
        entry.put("event", "request.start");
        entry.put("requestId", requestId);
        entry.put("userId", userId);
        entry.put("feature", feature);
        entry.put("inputLength", userMessage.length());
        
        // 生产环境只记录摘要
        if (log.isDebugEnabled()) {
            entry.put("userMessage", userMessage);
        } else {
            entry.put("userMessagePreview", truncate(userMessage, 100));
        }
        
        entry.put("timestamp", System.currentTimeMillis());
        
        log.info("LLM_REQUEST: {}", toJson(entry));
    }
    
    /**
     * 记录RAG检索结果
     */
    public void logRetrievalResult(
            String requestId, List<RetrievedDocument> docs, long latencyMs) {
        
        Map<String, Object> entry = new LinkedHashMap<>();
        entry.put("event", "retrieval.complete");
        entry.put("requestId", requestId);
        entry.put("resultCount", docs.size());
        entry.put("latencyMs", latencyMs);
        
        if (!docs.isEmpty()) {
            entry.put("topScore", docs.get(0).getScore());
            
            // Debug模式记录完整检索内容
            if (log.isDebugEnabled()) {
                entry.put("topDocPreview", truncate(docs.get(0).getContent(), 200));
            }
        }
        
        log.info("RAG_RETRIEVAL: {}", toJson(entry));
    }
    
    /**
     * 记录模型调用结果
     */
    public void logModelCallResult(
            String requestId, String modelId,
            int inputTokens, int outputTokens, 
            long latencyMs, boolean isError) {
        
        Map<String, Object> entry = new LinkedHashMap<>();
        entry.put("event", "model.call.complete");
        entry.put("requestId", requestId);
        entry.put("model", modelId);
        entry.put("inputTokens", inputTokens);
        entry.put("outputTokens", outputTokens);
        entry.put("totalTokens", inputTokens + outputTokens);
        entry.put("latencyMs", latencyMs);
        entry.put("isError", isError);
        entry.put("estimatedCostUsd", 
            inputTokens * 5.0 / 1_000_000 + outputTokens * 15.0 / 1_000_000);
        
        if (isError) {
            log.warn("MODEL_CALL_ERROR: {}", toJson(entry));
        } else {
            log.info("MODEL_CALL: {}", toJson(entry));
        }
    }
    
    /**
     * 记录质量问题
     * 
     * 当检测到潜在质量问题时调用
     */
    public void logQualityIssue(
            String requestId, String issueType, String details) {
        
        Map<String, Object> entry = new LinkedHashMap<>();
        entry.put("event", "quality.issue");
        entry.put("requestId", requestId);
        entry.put("issueType", issueType);
        entry.put("details", details);
        entry.put("timestamp", System.currentTimeMillis());
        
        log.warn("QUALITY_ISSUE: {}", toJson(entry));
    }
    
    private String truncate(String s, int maxLen) {
        if (s == null) return "";
        return s.length() > maxLen ? s.substring(0, maxLen) + "..." : s;
    }
    
    private String toJson(Map<String, Object> map) {
        try {
            return mapper.writeValueAsString(map);
        } catch (Exception e) {
            return map.toString();
        }
    }
}

关键指标监控

/**
 * LLM应用指标收集器
 * 
 * 向Prometheus/Grafana等系统暴露指标
 */
@Component
@RequiredArgsConstructor
@Slf4j
public class LlmMetricsCollector {
    
    private final MeterRegistry meterRegistry;
    
    // 延迟分布
    private final Timer modelCallLatency;
    private final Timer retrievalLatency;
    private final Timer e2eLatency;
    
    // 计数器
    private final Counter totalRequests;
    private final Counter errorRequests;
    private final Counter tokenUsageCounter;
    
    // 分布图（用于百分位数统计）
    private final DistributionSummary inputTokensDistribution;
    private final DistributionSummary outputTokensDistribution;
    
    public LlmMetricsCollector(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        
        // 模型调用延迟（按模型分tag）
        this.modelCallLatency = Timer.builder("llm.model_call.duration")
            .description("LLM模型调用延迟")
            .publishPercentiles(0.5, 0.9, 0.95, 0.99)
            .register(meterRegistry);
        
        this.retrievalLatency = Timer.builder("rag.retrieval.duration")
            .description("RAG检索延迟")
            .publishPercentiles(0.5, 0.9, 0.99)
            .register(meterRegistry);
        
        this.e2eLatency = Timer.builder("llm.request.duration")
            .description("端到端请求延迟")
            .publishPercentiles(0.5, 0.9, 0.99)
            .register(meterRegistry);
        
        this.totalRequests = Counter.builder("llm.requests.total")
            .description("总请求数")
            .register(meterRegistry);
        
        this.errorRequests = Counter.builder("llm.requests.errors")
            .description("错误请求数")
            .register(meterRegistry);
        
        this.tokenUsageCounter = Counter.builder("llm.tokens.total")
            .description("累计Token使用量")
            .register(meterRegistry);
        
        this.inputTokensDistribution = DistributionSummary.builder("llm.input_tokens")
            .description("每次请求的输入Token分布")
            .publishPercentiles(0.5, 0.9, 0.99)
            .register(meterRegistry);
        
        this.outputTokensDistribution = DistributionSummary.builder("llm.output_tokens")
            .description("每次请求的输出Token分布")
            .publishPercentiles(0.5, 0.9, 0.99)
            .register(meterRegistry);
    }
    
    public void recordModelCall(String modelId, long latencyMs, 
                                 int inputTokens, int outputTokens, boolean isError) {
        
        Timer.Sample sample = Timer.start(meterRegistry);
        sample.stop(Timer.builder("llm.model_call.duration")
            .tag("model", modelId)
            .tag("error", String.valueOf(isError))
            .publishPercentiles(0.5, 0.9, 0.99)
            .register(meterRegistry));
        
        tokenUsageCounter.increment(inputTokens + outputTokens);
        inputTokensDistribution.record(inputTokens);
        outputTokensDistribution.record(outputTokens);
        
        // 费用估算（USD，乘以1000000存整数避免精度问题）
        long costMicroUsd = (long)((inputTokens * 5.0 + outputTokens * 15.0) / 1000);
        meterRegistry.counter("llm.cost.micro_usd", "model", modelId)
            .increment(costMicroUsd);
    }
    
    public void recordRequest(String feature, boolean isError, long e2eLatencyMs) {
        totalRequests.increment();
        if (isError) errorRequests.increment();
        
        e2eLatency.record(e2eLatencyMs, java.util.concurrent.TimeUnit.MILLISECONDS);
        
        meterRegistry.counter("llm.requests.by_feature", "feature", feature)
            .increment();
    }
    
    public void recordUserFeedback(String feature, boolean positive) {
        meterRegistry.counter("llm.user_feedback",
            "feature", feature,
            "sentiment", positive ? "positive" : "negative"
        ).increment();
    }
    
    /**
     * 暴露关键健康指标（用于告警）
     */
    @Scheduled(fixedDelay = 60_000)  // 每分钟检查
    public void checkHealthMetrics() {
        
        // 计算最近5分钟的错误率
        double errorRate = 0;  // 从Prometheus查询，这里简化
        
        if (errorRate > 0.05) {  // 错误率超过5%
            log.warn("LLM应用错误率告警: errorRate={:.2f}%", errorRate * 100);
        }
        
        // 检查P99延迟
        double p99Latency = 0;  // 从Prometheus查询
        if (p99Latency > 30_000) {  // P99延迟超过30秒
            log.warn("LLM应用延迟告警: p99Latency={}ms", p99Latency);
        }
    }
}

实践建议

可观测性要"三位一体"：日志+追踪+指标缺一不可

日志告诉你"发生了什么事"，追踪告诉你"一个请求经历了哪些步骤"，指标告诉你"系统整体健康状况"。只有日志，你知道出了问题但不知道全貌；只有指标，你知道有问题但找不到具体案例；只有追踪，你能定位单次问题但不知道是否普遍。三者结合，才是完整的可观测性。

生产环境不要记录完整Prompt，除非有合规审计需求

完整的Prompt可能包含用户的敏感信息。在生产环境，默认只记录Prompt的长度、关键元数据（使用了哪个知识库、哪个模板）。只有在需要调试特定问题时，才对特定用户开启详细日志。这既保护了用户隐私，也避免了日志存储成本爆炸（完整Prompt每次可能几KB，日均百万请求就是几个TB的日志）。

建立"问题追溯"的黄金路径

当用户反馈"AI回答不对"时，你应该能在30秒内找到：那次请求的request_id、那次的检索内容、那次发给LLM的完整Prompt摘要。这要求：（1）每次请求有全局唯一的request_id，贯穿所有日志；（2）用户反馈要关联到request_id（在前端记录当前请求ID）；（3）日志按request_id方便查询（使用Elasticsearch或类似工具）。这条"追溯路径"能让问题定位时间从几小时缩短到几分钟。