AI应用监控仪表盘:Grafana可视化AI服务核心指标
2026/4/30大约 6 分钟
AI应用监控仪表盘:Grafana可视化AI服务核心指标
适读人群:AI应用已上线、想建立监控体系的Java工程师 阅读时长:约16分钟 文章价值:从0搭建AI监控大盘,让服务状态一目了然
先说一件真实的事
小李的AI客服系统在某个周五下午开始变慢,用户等待时间从正常的2秒涨到了10秒以上。问题持续了40分钟,才有用户投诉到客服。
事后复盘,发现原因很简单——Token消耗突然暴增(有人在批量刷接口),导致API速率限制触发,响应时间飙升。这个问题在发生的第2分钟就有明显数据异常,但没有监控系统,没有人知道。
"如果有监控,5分钟就能发现、处理。" 小李说。
AI应用的监控有其特殊性,不只是CPU/内存/接口延迟,还要关注Token消耗、模型质量、向量检索性能……今天就来把这一套监控体系搭起来。
AI监控的三个层次
很多团队只做了L1(基础设施监控),这对AI应用远远不够。L2才是问题最多的地方,L3才能反映真实业务价值。
技术栈选型
| 组件 | 作用 | 推荐版本 |
|---|---|---|
| Micrometer | 应用指标采集(JVM内) | 1.12.x |
| Prometheus | 指标存储+告警规则 | 2.x |
| Grafana | 可视化大盘 | 10.x |
| Spring Boot Actuator | 暴露 /actuator/prometheus 端点 | 3.x |
整体架构:
第一步:Spring Boot 侧埋点
依赖配置
<!-- Spring Boot Actuator -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Prometheus Registry -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>management:
endpoints:
web:
exposure:
include: "prometheus,health,info,metrics"
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active:default}AI核心指标埋点
@Component
@Slf4j
public class AiMetricsCollector {
// 定义指标
private final Counter chatRequestCounter;
private final Counter chatErrorCounter;
private final Timer chatLatencyTimer;
private final DistributionSummary inputTokenSummary;
private final DistributionSummary outputTokenSummary;
private final Counter cacheMissCounter;
private final Counter cacheHitCounter;
public AiMetricsCollector(MeterRegistry registry) {
// 请求总数
this.chatRequestCounter = Counter.builder("ai.chat.requests.total")
.description("AI对话请求总数")
.tags("type", "chat")
.register(registry);
// 错误数
this.chatErrorCounter = Counter.builder("ai.chat.errors.total")
.description("AI对话错误总数")
.register(registry);
// 延迟分布(含P50/P95/P99)
this.chatLatencyTimer = Timer.builder("ai.chat.latency")
.description("AI对话响应延迟")
.publishPercentiles(0.5, 0.95, 0.99)
.publishPercentileHistogram()
.register(registry);
// Token消耗
this.inputTokenSummary = DistributionSummary.builder("ai.tokens.input")
.description("输入Token数量")
.register(registry);
this.outputTokenSummary = DistributionSummary.builder("ai.tokens.output")
.description("输出Token数量")
.register(registry);
// 缓存指标
this.cacheMissCounter = Counter.builder("ai.cache.miss").register(registry);
this.cacheHitCounter = Counter.builder("ai.cache.hit").register(registry);
}
public void recordRequest(String userId, String modelName) {
chatRequestCounter.increment(Tags.of("user_id", maskUserId(userId), "model", modelName));
}
public void recordLatency(long latencyMs, String modelName) {
chatLatencyTimer.record(latencyMs, TimeUnit.MILLISECONDS);
}
public void recordTokenUsage(int inputTokens, int outputTokens, String modelName) {
inputTokenSummary.record(inputTokens);
outputTokenSummary.record(outputTokens);
}
public void recordError(String errorType, String modelName) {
chatErrorCounter.increment(Tags.of("error_type", errorType, "model", modelName));
}
private String maskUserId(String userId) {
// 脱敏,只保留前4位
return userId.length() > 4 ? userId.substring(0, 4) + "****" : "****";
}
}在Advisor中自动采集
@Component
public class MetricsAdvisor implements CallAroundAdvisor {
private final AiMetricsCollector metricsCollector;
private final MeterRegistry registry;
@Override
public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
String modelName = extractModelName(request);
long start = System.currentTimeMillis();
metricsCollector.recordRequest("system", modelName);
try {
AdvisedResponse response = chain.nextAroundCall(request);
long latency = System.currentTimeMillis() - start;
metricsCollector.recordLatency(latency, modelName);
// 从响应元数据中获取Token使用量
Usage usage = response.response().getMetadata().getUsage();
if (usage != null) {
metricsCollector.recordTokenUsage(
usage.getPromptTokens(),
usage.getGenerationTokens(),
modelName
);
}
return response;
} catch (Exception e) {
metricsCollector.recordError(e.getClass().getSimpleName(), modelName);
throw e;
}
}
@Override
public String getName() { return "MetricsAdvisor"; }
@Override
public int getOrder() { return Ordered.HIGHEST_PRECEDENCE; }
}第二步:向量检索指标
向量检索是RAG系统的重要环节,也要专门监控:
@Service
@Slf4j
public class MonitoredVectorStore implements VectorStore {
private final VectorStore delegate;
private final MeterRegistry registry;
private final Timer searchTimer;
private final DistributionSummary resultCountSummary;
private final DistributionSummary similarityScoreSummary;
public MonitoredVectorStore(VectorStore delegate, MeterRegistry registry) {
this.delegate = delegate;
this.registry = registry;
this.searchTimer = Timer.builder("ai.vector.search.latency")
.description("向量检索延迟")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
this.resultCountSummary = DistributionSummary.builder("ai.vector.search.results")
.description("每次检索返回的文档数量")
.register(registry);
this.similarityScoreSummary = DistributionSummary.builder("ai.vector.search.similarity")
.description("检索结果的相似度分数")
.register(registry);
}
@Override
public List<Document> similaritySearch(SearchRequest request) {
return searchTimer.record(() -> {
List<Document> results = delegate.similaritySearch(request);
resultCountSummary.record(results.size());
// 记录相似度分数分布
results.forEach(doc -> {
Object score = doc.getMetadata().get("distance");
if (score instanceof Number) {
similarityScoreSummary.record(((Number) score).doubleValue());
}
});
// 检索结果为空告警
if (results.isEmpty()) {
registry.counter("ai.vector.search.empty",
"query_length", String.valueOf(request.getQuery().length() / 10 * 10))
.increment();
}
return results;
});
}
@Override
public void add(List<Document> documents) {
long start = System.currentTimeMillis();
delegate.add(documents);
registry.timer("ai.vector.add.latency")
.record(System.currentTimeMillis() - start, TimeUnit.MILLISECONDS);
registry.counter("ai.vector.documents.added").increment(documents.size());
}
}第三步:Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'ai-service'
static_configs:
- targets: ['your-app-host:8080']
metrics_path: '/actuator/prometheus'
rule_files:
- 'ai_alerts.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']告警规则:
# ai_alerts.yml
groups:
- name: ai_service_alerts
rules:
# 响应延迟告警
- alert: AiHighLatency
expr: histogram_quantile(0.95, rate(ai_chat_latency_seconds_bucket[5m])) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "AI响应P95延迟超过5秒"
description: "当前P95延迟: {{ $value }}秒"
# 错误率告警
- alert: AiHighErrorRate
expr: rate(ai_chat_errors_total[5m]) / rate(ai_chat_requests_total[5m]) > 0.05
for: 1m
labels:
severity: critical
annotations:
summary: "AI错误率超过5%"
description: "当前错误率: {{ $value | humanizePercentage }}"
# Token消耗异常告警
- alert: AiTokenSpikeDetected
expr: rate(ai_tokens_input_sum[5m]) > 2 * rate(ai_tokens_input_sum[30m] offset 1h)
for: 3m
labels:
severity: warning
annotations:
summary: "Token消耗异常激增"
description: "当前消耗速率是过去1小时均值的2倍以上"第四步:Grafana 大盘设计
推荐的大盘布局
关键 Grafana 面板的 PromQL 查询:
# 每分钟请求量
rate(ai_chat_requests_total[1m]) * 60
# P95响应延迟
histogram_quantile(0.95, rate(ai_chat_latency_seconds_bucket[5m]))
# 错误率(百分比)
rate(ai_chat_errors_total[5m]) / rate(ai_chat_requests_total[5m]) * 100
# Token消耗速率(tokens/min)
rate(ai_tokens_input_sum[1m]) * 60 + rate(ai_tokens_output_sum[1m]) * 60
# 估算成本(以GPT-3.5为例:input $0.5/1M, output $1.5/1M)
(rate(ai_tokens_input_sum[1h]) * 0.5 + rate(ai_tokens_output_sum[1h]) * 1.5) / 1000000
# 缓存命中率
rate(ai_cache_hit_total[5m]) / (rate(ai_cache_hit_total[5m]) + rate(ai_cache_miss_total[5m])) * 100第五步:告警通知集成
@Component
@Slf4j
public class AlertNotificationService {
private final RestTemplate restTemplate;
@Value("${alert.dingtalk.webhook:}")
private String dingtalkWebhook;
/**
* 发送钉钉告警(也可以换成企微/飞书)
*/
public void sendAlert(AlertEvent event) {
if (dingtalkWebhook.isBlank()) return;
String markdown = String.format("""
## ⚠️ AI服务告警
**告警级别**: %s
**告警名称**: %s
**告警时间**: %s
**详情**: %s
> 请及时处理
""",
event.getSeverity(),
event.getTitle(),
LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")),
event.getDescription()
);
Map<String, Object> body = Map.of(
"msgtype", "markdown",
"markdown", Map.of("title", event.getTitle(), "text", markdown)
);
try {
restTemplate.postForObject(dingtalkWebhook, body, String.class);
} catch (Exception e) {
log.error("告警通知发送失败", e);
}
}
}监控指标参考阈值
| 指标 | 健康值 | 警告阈值 | 严重阈值 |
|---|---|---|---|
| P95响应延迟 | < 2s | 2-5s | > 5s |
| 错误率 | < 1% | 1-5% | > 5% |
| Token消耗/小时 | 基线±20% | 基线±50% | 基线2倍 |
| 缓存命中率 | > 40% | 20-40% | < 20% |
| 向量检索延迟 | < 100ms | 100-500ms | > 500ms |
| 检索空结果率 | < 5% | 5-15% | > 15% |
Docker Compose 一键启动监控栈
# docker-compose-monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./ai_alerts.yml:/etc/prometheus/ai_alerts.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
grafana_data:小结
AI应用监控不只是把老三样(CPU/内存/接口延迟)搬过来,还需要专门的AI层面指标:Token消耗、向量检索质量、模型响应延迟的P95/P99。
小李的系统加上监控大盘之后,再遇到类似的Token暴增问题,2分钟就告警推到了手机,处理时间从40分钟缩短到了5分钟。用他的话说:"现在AI系统跑着踏实多了,有数据心里才不慌。"
