Spring AI可观测性：用Micrometer追踪AI调用全链路

老张2026/4/30大约 10 分钟

Spring AI可观测性：用Micrometer追踪AI调用全链路

适读人群：有1-5年Java开发经验，想向AI工程师方向转型的开发者 阅读时长：约18分钟 文章价值：
掌握Spring AI + Micrometer完整可观测性方案
学会追踪Token消耗、延迟、错误率等核心指标
能独立搭建AI调用全链路监控体系

凌晨三点，老王的告警轰炸

老王是某在线教育平台的Java架构师，他们的AI答疑系统已经上线三个月了，日均调用量十万次。

那天凌晨三点，他被一堆告警短信吵醒——用户反馈AI回答变慢了，从平均2秒飙到了10秒以上。他翻来覆去看日志，全是正常的Spring Boot日志，没有异常堆栈，就是慢。

"到底是哪里慢？是我的业务代码？是网络？还是OpenAI本身？"

他完全不知道。花了四个小时排查，最终发现是他的Prompt模板里混进了一段用户输入的超长文本，导致每次请求的Token数从500暴增到5000。但这四个小时，是完全可以避免的——只要他有一套像样的可观测性方案。

这件事之后，他找到我，说了一句让我记住很久的话："普通的Java服务，出了问题还能看线程栈、SQL慢查询。AI服务出了问题，我是个睁眼瞎。"

今天这篇文章，就是为了让你不成为老王。

为什么AI服务的可观测性特别难搞

传统Java服务的可观测性，我们已经有一套成熟方案：Prometheus + Grafana监控指标，Jaeger/Zipkin追踪链路，ELK聚合日志。但AI服务有几个特殊性：

1. 成本不透明：每次HTTP调用背后是Token消耗，而Token直接和钱挂钩。你不知道每次调用烧了多少钱。

2. 延迟特征不同：普通接口延迟正态分布，AI接口延迟和输出Token数强相关，P99可能是P50的10倍。

3. 错误类型多样：除了HTTP 5xx，还有rate limit、context too long、content policy等AI特有错误。

4. Prompt质量难以量化：同样的问题，不同Prompt质量天差地别，但你不知道哪个Prompt效果好。

Spring AI 1.0在这方面做了很好的工作，内置了Micrometer集成，让我们可以用统一的可观测性框架来处理这些问题。

整体架构设计

先看整体方案的架构：

调用链路的时序图：

环境搭建

Maven依赖配置

<!-- pom.xml -->
<properties>
    <java.version>17</java.version>
    <spring-ai.version>1.0.0</spring-ai.version>
</properties>

<dependencies>
    <!-- Spring AI核心 -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        <version>${spring-ai.version}</version>
    </dependency>

    <!-- Micrometer Prometheus指标导出 -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>

    <!-- Micrometer Tracing核心 -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-tracing-bridge-brave</artifactId>
    </dependency>

    <!-- Zipkin链路上报 -->
    <dependency>
        <groupId>io.zipkin.reporter2</groupId>
        <artifactId>zipkin-reporter-brave</artifactId>
    </dependency>

    <!-- Spring Boot Actuator -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>

    <!-- AOP支持（用于自定义切面） -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-aop</artifactId>
    </dependency>
</dependencies>

application.yml配置

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o-mini
          temperature: 0.7
  application:
    name: ai-observability-demo

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus,info
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}
      env: ${SPRING_PROFILES_ACTIVE:local}
  tracing:
    sampling:
      probability: 1.0  # 生产环境建议设为0.1
  zipkin:
    tracing:
      endpoint: http://localhost:9411/api/v2/spans

logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

核心实现：自定义AI观测切面

Spring AI内置了基础的Micrometer集成，但我们需要更细粒度的控制。下面是我生产中用的完整方案：

package com.laozhang.ai.observability;

import io.micrometer.core.instrument.*;
import io.micrometer.tracing.Tracer;
import io.micrometer.tracing.Span;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.aspectj.lang.ProceedingJoinPoint;
import org.aspectj.lang.annotation.Around;
import org.aspectj.lang.annotation.Aspect;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.model.ChatResponse;
import org.springframework.ai.chat.metadata.Usage;
import org.springframework.stereotype.Component;

import java.util.concurrent.TimeUnit;

/**
 * AI调用可观测性切面
 * 自动采集Token消耗、延迟、错误率等指标
 */
@Aspect
@Component
@Slf4j
@RequiredArgsConstructor
public class AIObservabilityAspect {

    private final MeterRegistry meterRegistry;
    private final Tracer tracer;

    // 指标名称常量
    private static final String METRIC_AI_CALLS = "ai.chat.calls";
    private static final String METRIC_AI_TOKENS = "ai.chat.tokens";
    private static final String METRIC_AI_DURATION = "ai.chat.duration";
    private static final String METRIC_AI_ERRORS = "ai.chat.errors";

    /**
     * 拦截所有标注了@AIObserved注解的方法
     */
    @Around("@annotation(aiObserved)")
    public Object observeAICall(ProceedingJoinPoint joinPoint, 
                                 AIObserved aiObserved) throws Throwable {
        
        String operationName = aiObserved.value().isEmpty() 
            ? joinPoint.getSignature().getName() 
            : aiObserved.value();
        
        // 创建Tracing Span
        Span span = tracer.nextSpan()
            .name("ai.chat." + operationName)
            .tag("ai.operation", operationName)
            .tag("ai.model", aiObserved.model())
            .start();

        long startTime = System.currentTimeMillis();
        
        // 计数器：调用次数
        Counter.builder(METRIC_AI_CALLS)
            .tag("operation", operationName)
            .tag("model", aiObserved.model())
            .description("AI调用总次数")
            .register(meterRegistry)
            .increment();

        try (var ignored = tracer.withSpan(span)) {
            Object result = joinPoint.proceed();
            
            long duration = System.currentTimeMillis() - startTime;
            
            // 如果返回值是ChatResponse，提取Token信息
            if (result instanceof ChatResponse chatResponse) {
                recordTokenMetrics(chatResponse, operationName, aiObserved.model());
            }
            
            // 记录成功延迟
            Timer.builder(METRIC_AI_DURATION)
                .tag("operation", operationName)
                .tag("model", aiObserved.model())
                .tag("status", "success")
                .description("AI调用耗时")
                .register(meterRegistry)
                .record(duration, TimeUnit.MILLISECONDS);
            
            span.tag("ai.duration_ms", String.valueOf(duration));
            span.event("ai.call.success");
            
            log.info("[AI观测] operation={}, model={}, duration={}ms", 
                operationName, aiObserved.model(), duration);
            
            return result;
            
        } catch (Exception e) {
            long duration = System.currentTimeMillis() - startTime;
            
            // 记录错误指标
            Counter.builder(METRIC_AI_ERRORS)
                .tag("operation", operationName)
                .tag("model", aiObserved.model())
                .tag("error_type", e.getClass().getSimpleName())
                .description("AI调用错误次数")
                .register(meterRegistry)
                .increment();
            
            // 记录失败延迟（也要记录，方便分析错误时的延迟分布）
            Timer.builder(METRIC_AI_DURATION)
                .tag("operation", operationName)
                .tag("model", aiObserved.model())
                .tag("status", "error")
                .register(meterRegistry)
                .record(duration, TimeUnit.MILLISECONDS);
            
            span.tag("error", e.getMessage());
            span.tag("error_type", e.getClass().getSimpleName());
            span.event("ai.call.error");
            
            log.error("[AI观测] operation={}, error={}, duration={}ms", 
                operationName, e.getMessage(), duration);
            
            throw e;
        } finally {
            span.end();
        }
    }

    /**
     * 记录Token相关指标
     */
    private void recordTokenMetrics(ChatResponse response, 
                                     String operation, String model) {
        if (response.getMetadata() == null) return;
        
        Usage usage = response.getMetadata().getUsage();
        if (usage == null) return;
        
        // 输入Token
        if (usage.getPromptTokens() != null) {
            Counter.builder(METRIC_AI_TOKENS)
                .tag("operation", operation)
                .tag("model", model)
                .tag("type", "input")
                .description("AI输入Token消耗")
                .register(meterRegistry)
                .increment(usage.getPromptTokens());
        }
        
        // 输出Token
        if (usage.getGenerationTokens() != null) {
            Counter.builder(METRIC_AI_TOKENS)
                .tag("operation", operation)
                .tag("model", model)
                .tag("type", "output")
                .description("AI输出Token消耗")
                .register(meterRegistry)
                .increment(usage.getGenerationTokens());
        }
        
        // 总Token（方便计算成本）
        if (usage.getTotalTokens() != null) {
            Counter.builder(METRIC_AI_TOKENS)
                .tag("operation", operation)
                .tag("model", model)
                .tag("type", "total")
                .register(meterRegistry)
                .increment(usage.getTotalTokens());
            
            log.debug("[Token统计] operation={}, input={}, output={}, total={}", 
                operation, 
                usage.getPromptTokens(),
                usage.getGenerationTokens(),
                usage.getTotalTokens());
        }
    }
}

自定义注解

package com.laozhang.ai.observability;

import java.lang.annotation.*;

/**
 * 标注需要被AI可观测性切面拦截的方法
 * 
 * 使用示例：
 * @AIObserved(value = "question-answering", model = "gpt-4o-mini")
 * public String answerQuestion(String question) { ... }
 */
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
@Documented
public @interface AIObserved {
    
    /**
     * 操作名称，用于指标标签
     * 不填则使用方法名
     */
    String value() default "";
    
    /**
     * 使用的模型名称，用于指标标签
     */
    String model() default "gpt-4o-mini";
}

业务服务中的使用

package com.laozhang.ai.service;

import com.laozhang.ai.observability.AIObserved;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.model.ChatResponse;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;

/**
 * 教育AI答疑服务
 * 演示如何在业务代码中使用AI可观测性
 */
@Service
@Slf4j
@RequiredArgsConstructor
public class EducationAIService {

    private final ChatModel chatModel;

    /**
     * 答疑功能 - 自动被Micrometer追踪
     */
    @AIObserved(value = "question-answering", model = "gpt-4o-mini")
    public String answerQuestion(String subject, String question) {
        SystemMessage systemMessage = new SystemMessage(
            "你是一位专业的" + subject + "老师，请用简洁易懂的语言回答学生的问题。"
        );
        UserMessage userMessage = new UserMessage(question);
        
        Prompt prompt = new Prompt(List.of(systemMessage, userMessage));
        ChatResponse response = chatModel.call(prompt);
        
        return response.getResult().getOutput().getContent();
    }

    /**
     * 批改作文 - 另一个被追踪的操作
     */
    @AIObserved(value = "essay-grading", model = "gpt-4o")
    public String gradeEssay(String essay) {
        String template = """
            请批改以下作文，给出分数（满分100分）和详细评语：
            
            作文内容：
            {essay}
            
            请按照以下格式输出：
            分数：XX分
            评语：...
            改进建议：...
            """;
        
        PromptTemplate promptTemplate = new PromptTemplate(template);
        Prompt prompt = promptTemplate.create(Map.of("essay", essay));
        
        ChatResponse response = chatModel.call(prompt);
        return response.getResult().getOutput().getContent();
    }
}

Token成本计算器

这是我给老王写的一个实用工具，能实时统计AI调用成本：

package com.laozhang.ai.cost;

import io.micrometer.core.instrument.MeterRegistry;
import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component;

import java.util.HashMap;
import java.util.Map;

/**
 * AI调用成本计算器
 * 基于Micrometer指标实时计算Token消耗成本
 */
@Component
@RequiredArgsConstructor
public class AICostCalculator {

    private final MeterRegistry meterRegistry;

    // 各模型每1000 Token的价格（美元），2024年参考价格
    private static final Map<String, ModelPricing> MODEL_PRICING = new HashMap<>();
    
    static {
        // GPT-4o-mini：最便宜，日常问答首选
        MODEL_PRICING.put("gpt-4o-mini", new ModelPricing(0.000150, 0.000600));
        // GPT-4o：性能最好，复杂任务用
        MODEL_PRICING.put("gpt-4o", new ModelPricing(0.005, 0.015));
        // Claude 3.5 Sonnet
        MODEL_PRICING.put("claude-3-5-sonnet", new ModelPricing(0.003, 0.015));
    }

    /**
     * 获取指定操作的实时成本报告
     */
    public CostReport getCostReport(String operation, String model) {
        ModelPricing pricing = MODEL_PRICING.getOrDefault(model, 
            new ModelPricing(0.001, 0.002));
        
        // 从Micrometer获取Token计数器
        double inputTokens = getCounterValue("ai.chat.tokens", operation, model, "input");
        double outputTokens = getCounterValue("ai.chat.tokens", operation, model, "output");
        double totalCalls = getCounterValue("ai.chat.calls", operation, model, null);
        double errorCalls = getCounterValue("ai.chat.errors", operation, model, null);
        
        // 计算成本（按1000 Token计价）
        double inputCost = (inputTokens / 1000.0) * pricing.inputPricePer1k();
        double outputCost = (outputTokens / 1000.0) * pricing.outputPricePer1k();
        double totalCost = inputCost + outputCost;
        
        // 计算平均每次调用Token数
        double avgTokensPerCall = totalCalls > 0 ? (inputTokens + outputTokens) / totalCalls : 0;
        double errorRate = totalCalls > 0 ? errorCalls / totalCalls * 100 : 0;
        
        return new CostReport(
            operation, model,
            (long) totalCalls, (long) errorCalls,
            (long) inputTokens, (long) outputTokens,
            inputCost, outputCost, totalCost,
            avgTokensPerCall, errorRate
        );
    }

    private double getCounterValue(String metricName, String operation, 
                                    String model, String type) {
        try {
            var search = meterRegistry.find(metricName)
                .tag("operation", operation)
                .tag("model", model);
            
            if (type != null) {
                search = search.tag("type", type);
            }
            
            var counter = search.counter();
            return counter != null ? counter.count() : 0.0;
        } catch (Exception e) {
            return 0.0;
        }
    }

    public record ModelPricing(double inputPricePer1k, double outputPricePer1k) {}
    
    public record CostReport(
        String operation,
        String model,
        long totalCalls,
        long errorCalls,
        long inputTokens,
        long outputTokens,
        double inputCostUSD,
        double outputCostUSD,
        double totalCostUSD,
        double avgTokensPerCall,
        double errorRate
    ) {
        @Override
        public String toString() {
            return String.format(
                "[成本报告] 操作=%s, 模型=%s, 调用次数=%d, 错误率=%.1f%%, " +
                "输入Token=%d, 输出Token=%d, 总成本=$%.4f, 平均Token/次=%.0f",
                operation, model, totalCalls, errorRate,
                inputTokens, outputTokens, totalCostUSD, avgTokensPerCall
            );
        }
    }
}

对比：有无可观测性的差距

场景	没有可观测性	有可观测性
排查性能问题	凭感觉猜测，可能花4小时	看延迟分布图，10分钟定位
成本超支	月底账单来了才知道	实时告警，超阈值立刻通知
错误率上升	用户投诉才知道	错误率图表，自动告警
Prompt优化效果	无法量化	Token数变化一目了然
容量规划	凭经验估算	基于实际调用量数据决策
故障复盘	看不到调用链路	完整Trace，快速定位根因

Grafana仪表盘配置

核心Prometheus查询语句，直接用在Grafana中：

# 每分钟调用次数
rate(ai_chat_calls_total[1m])

# P95延迟（毫秒）
histogram_quantile(0.95, rate(ai_chat_duration_seconds_bucket[5m])) * 1000

# 错误率
rate(ai_chat_errors_total[5m]) / rate(ai_chat_calls_total[5m]) * 100

# 每小时Token消耗
increase(ai_chat_tokens_total{type="total"}[1h])

# 估算每天成本（gpt-4o-mini定价）
(
  increase(ai_chat_tokens_total{type="input"}[24h]) * 0.00015 / 1000
  +
  increase(ai_chat_tokens_total{type="output"}[24h]) * 0.0006 / 1000
)

告警规则配置

# alerting-rules.yml - 适合直接放到Prometheus配置中
groups:
  - name: ai-service-alerts
    rules:
      # 错误率超过5%告警
      - alert: AIHighErrorRate
        expr: |
          rate(ai_chat_errors_total[5m]) / rate(ai_chat_calls_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "AI服务错误率过高"
          description: "操作 {{ $labels.operation }} 错误率达到 {{ $value | humanizePercentage }}"

      # P95延迟超过10秒告警
      - alert: AIHighLatency
        expr: |
          histogram_quantile(0.95, rate(ai_chat_duration_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI服务延迟过高"
          description: "P95延迟达到 {{ $value }}秒"

      # 每小时Token超过10万告警（成本控制）
      - alert: AIHighTokenUsage
        expr: |
          increase(ai_chat_tokens_total{type="total"}[1h]) > 100000
        labels:
          severity: warning
        annotations:
          summary: "AI Token消耗过高"
          description: "过去1小时消耗Token: {{ $value }}"

总结

可观测性这件事，投入一天搭好，能给你省掉以后无数个凌晨三点的噩梦。

核心要抓三类数据：

延迟：P50/P95/P99，不要只看平均值，AI服务的长尾效应很明显
Token：分输入/输出记录，直接对应成本
错误：按错误类型分类，rate limit和业务错误要分开处理

Spring AI + Micrometer这套方案的好处是和现有Spring Boot生态无缝衔接，如果你们已经有Prometheus + Grafana，接入成本极低，加几个依赖、写个切面，当天就能用起来。