Prometheus + Grafana 监控 AI 应用——关键指标和告警规则设计

老张2026/4/30大约 9 分钟

Prometheus + Grafana 监控 AI 应用——关键指标和告警规则设计

我见过不少团队的 AI 应用监控大盘，大多数长这样：一个 CPU 折线图、一个内存折线图、一个 HTTP 请求数折线图，然后就没了。

这不是没有监控，这是有监控但不会用。

更极端的情况是：Prometheus + Grafana 配得挺完整，告警也配了，但告警规则全是从普通 Web 应用的模板里复制的。"响应时间 P99 > 500ms 告警"——对于一个调用 GPT-4 的接口，P99 500ms 根本不现实，这条规则从来不会触发，也从来不能起到保护作用。

这篇文章假设你已经知道 Prometheus 和 Grafana 是什么、怎么安装，我们直接讲 AI 应用特有的监控配置：什么指标要采集、告警规则怎么设计，以及一个可以直接用的 Grafana Dashboard 配置片段。

一、AI 应用监控的四个核心指标域

在开始写配置之前，先明确我们要监控什么。AI 应用的监控指标可以分成四个域：

每个域的监控目的不同，告警阈值的设计思路也不同。下面分域讲解。

二、指标定义和 Java 埋点实现

2.1 请求域指标

@Component
public class AiPrometheusMetrics {

    private final MeterRegistry registry;

    // ai_request_total{model="gpt-4o", status="success", scene="rag_query"}
    // 描述：AI 请求计数，按模型、状态、业务场景分维度
    private Counter buildRequestCounter(String model, String status, String scene) {
        return Counter.builder("ai_request_total")
                .description("AI 请求总数")
                .tag("model", model)
                .tag("status", status)   // success / error / timeout / filtered
                .tag("scene", scene)     // rag_query / chat / summarize
                .register(registry);
    }

    // ai_request_active{model="gpt-4o"}
    // 描述：当前正在进行的 AI 请求数（用于检测积压）
    private Gauge buildActiveRequestGauge(AtomicInteger activeCount, String model) {
        return Gauge.builder("ai_request_active", activeCount, AtomicInteger::get)
                .description("当前活跃 AI 请求数")
                .tag("model", model)
                .register(registry);
    }
}

status 维度的取值很关键，不要只有 success/error，要细分：

success：正常完成
timeout：超时
filtered：内容被模型安全策略过滤
token_exceeded：超出 Token 限制
model_error：模型服务报错（区别于我们自己的应用错误）

2.2 Token 域指标

// ai_token_usage_total{model="gpt-4o", type="input", scene="rag_query"}
// 描述：Token 消耗累计量（Counter 类型，只增不减）
public void recordTokenUsage(String model, String scene, int inputTokens, int outputTokens) {
    registry.counter("ai_token_usage_total",
            "model", model,
            "type", "input",
            "scene", scene
    ).increment(inputTokens);

    registry.counter("ai_token_usage_total",
            "model", model,
            "type", "output",
            "scene", scene
    ).increment(outputTokens);
}

// ai_token_cost_estimate_total{model="gpt-4o"}
// 描述：估算成本（美元计，基于已知定价）
public void recordCostEstimate(String model, int inputTokens, int outputTokens) {
    // GPT-4o 定价（仅示例，实际以官网为准）
    Map<String, double[]> pricing = Map.of(
        "gpt-4o",      new double[]{0.005, 0.015},   // per 1k tokens: input, output
        "gpt-4o-mini", new double[]{0.00015, 0.0006},
        "claude-3-5-sonnet", new double[]{0.003, 0.015}
    );

    double[] prices = pricing.getOrDefault(model, new double[]{0.01, 0.03});
    double cost = (inputTokens / 1000.0 * prices[0]) + (outputTokens / 1000.0 * prices[1]);

    registry.counter("ai_token_cost_estimate_total",
            "model", model
    ).increment(cost);
}

2.3 延迟域指标

// ai_ttft_seconds{model="gpt-4o", scene="rag_query"}
// 描述：首 Token 时间，只在流式场景下有意义
public void recordTtft(String model, String scene, long ttftMs) {
    registry.timer("ai_ttft_seconds",
            "model", model,
            "scene", scene
    ).record(ttftMs, TimeUnit.MILLISECONDS);
}

// ai_request_duration_seconds{model="gpt-4o", scene="rag_query", phase="retrieval"}
// 描述：分阶段延迟，phase 可以是 retrieval/prompt_build/model_call/postprocess
public void recordPhaseLatency(String model, String scene, String phase, long durationMs) {
    registry.timer("ai_request_duration_seconds",
            "model", model,
            "scene", scene,
            "phase", phase
    ).record(durationMs, TimeUnit.MILLISECONDS);
}

2.4 质量域指标

// ai_semantic_score{collection="product_kb"}
// 描述：RAG 检索的语义相关性得分分布
public void recordSemanticScore(String collection, double score) {
    registry.summary("ai_semantic_score",
            "collection", collection
    ).record(score);
}

// ai_content_filter_total{model="gpt-4o", reason="hate_speech"}
// 描述：内容被过滤的计数（可以暴露异常攻击或测试流量）
public void recordContentFilter(String model, String reason) {
    registry.counter("ai_content_filter_total",
            "model", model,
            "reason", reason
    ).increment();
}

三、Prometheus 告警规则设计

这是整篇文章最有含金量的部分。告警规则设计不好，要么告警风暴（每天都在响但没人管），要么盲区太大（出了大事才知道）。

3.1 告警规则总览

先看整体规则文件结构，然后逐条解析：

# ai_alerting_rules.yml
groups:
  - name: ai_application_alerts
    rules:

      # ====== Token 相关告警 ======

      # 告警1：单请求 Token 超限风险
      - alert: AiTokenUsageHigh
        expr: |
          rate(ai_token_usage_total{type="input"}[5m]) 
          / rate(ai_request_total{status="success"}[5m]) > 3000
        for: 3m
        labels:
          severity: warning
          team: ai-platform
        annotations:
          summary: "模型 {{ $labels.model }} 平均输入 Token 偏高"
          description: |
            模型 {{ $labels.model }} 在 {{ $labels.scene }} 场景下，
            最近 5 分钟平均输入 Token 为 {{ $value | humanize }}，
            超过警戒线 3000。可能存在 Prompt 膨胀或上下文泄漏。

      # 告警2：每日成本超限（需要配合 recording rule）
      - alert: AiDailyCostAlert
        expr: |
          sum(increase(ai_token_cost_estimate_total[24h])) > 100
        for: 0m
        labels:
          severity: critical
          team: ai-platform
        annotations:
          summary: "AI 日均成本超过 $100"
          description: "过去 24 小时 AI 调用估算成本 ${{ $value | humanize }}，已超预算阈值。"

      # ====== 延迟相关告警 ======

      # 告警3：TTFT P99 过高
      - alert: AiTtftP99High
        expr: |
          histogram_quantile(0.99, 
            rate(ai_ttft_seconds_bucket[5m])
          ) > 3
        for: 5m
        labels:
          severity: warning
          team: ai-platform
        annotations:
          summary: "模型 {{ $labels.model }} TTFT P99 超过 3 秒"
          description: |
            {{ $labels.scene }} 场景 TTFT P99 = {{ $value | humanizeDuration }}。
            用户体验已受影响，请检查模型服务状态或上游网络。

      # 告警4：请求总延迟突增
      - alert: AiLatencySpike
        expr: |
          (
            histogram_quantile(0.99, rate(ai_request_duration_seconds_bucket[5m]))
            /
            histogram_quantile(0.99, rate(ai_request_duration_seconds_bucket[30m] offset 5m))
          ) > 2
        for: 3m
        labels:
          severity: warning
          team: ai-platform
        annotations:
          summary: "AI 请求延迟突增（当前 P99 是基线的 2 倍以上）"
          description: |
            {{ $labels.model }} 在 {{ $labels.scene }} 场景延迟突增，
            当前 P99 延迟相比 30 分钟前基线增加超过 100%。

      # ====== 错误率相关告警 ======

      # 告警5：整体错误率告警
      - alert: AiErrorRateHigh
        expr: |
          rate(ai_request_total{status=~"error|timeout|model_error"}[5m])
          /
          rate(ai_request_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
          team: ai-platform
        annotations:
          summary: "AI 请求错误率超过 5%"
          description: |
            {{ $labels.model }} 错误率 = {{ $value | humanizePercentage }}。
            需要立即排查模型服务可用性。

      # 告警6：内容过滤率异常（可能有攻击或测试流量）
      - alert: AiContentFilterSpike
        expr: |
          rate(ai_content_filter_total[10m]) > 5
        for: 2m
        labels:
          severity: warning
          team: ai-security
        annotations:
          summary: "内容过滤触发频率异常"
          description: |
            最近 10 分钟触发内容过滤 {{ $value | humanize }} 次/分钟，
            可能存在异常流量或 Prompt 注入攻击。

      # ====== 质量域告警 ======

      # 告警7：RAG 语义相关性下降
      - alert: AiSemanticScoreDrop
        expr: |
          avg_over_time(ai_semantic_score[10m]) < 0.6
        for: 10m
        labels:
          severity: warning
          team: ai-platform
        annotations:
          summary: "RAG 检索质量下降"
          description: |
            知识库 {{ $labels.collection }} 最近 10 分钟平均语义相关性得分 
            {{ $value | humanize }}，低于 0.6 警戒线。
            可能原因：Embedding 模型变更、知识库向量失效、查询模式漂移。

      # 告警8：活跃请求数积压
      - alert: AiRequestBackpressure
        expr: ai_request_active > 50
        for: 2m
        labels:
          severity: warning
          team: ai-platform
        annotations:
          summary: "AI 请求积压"
          description: |
            当前活跃 AI 请求数 {{ $value }}，超过 50。
            可能原因：模型响应变慢导致请求堆积，需要检查限流配置。

3.2 告警规则设计原则

原则一：for 时长要合理。AI 接口本身有延迟，短时的尖刺很正常。for: 0m 会导致告警风暴，for: 10m 又太迟钝。经验值：严重错误用 2-3m，质量类告警用 5-10m。

原则二：比例类告警优于绝对值。"P99 延迟 > 5 秒"是绝对值告警，问题是不同模型的基线延迟不同。用比例告警"当前 P99 是历史基线的 2 倍以上"更鲁棒。

原则三：成本类告警要做多层保护。除了 24 小时成本告警，还需要配置更快速响应的：1 小时异常消耗告警，以及单请求的 Token 上限检查。

四、Recording Rules 优化查询性能

AI 应用的 Token 指标量很大，复杂的 PromQL 查询在 Grafana 里会非常慢。用 Recording Rules 提前计算：

groups:
  - name: ai_recording_rules
    interval: 1m
    rules:
      # 每模型每分钟 Token 消耗速率
      - record: job:ai_token_rate1m:sum
        expr: |
          sum by (model, type) (
            rate(ai_token_usage_total[1m])
          )

      # 每模型请求成功率（滑动窗口）
      - record: job:ai_success_rate5m:avg
        expr: |
          sum by (model, scene) (rate(ai_request_total{status="success"}[5m]))
          /
          sum by (model, scene) (rate(ai_request_total[5m]))

      # TTFT P99（预计算，减少 histogram_quantile 实时计算开销）
      - record: job:ai_ttft_p99_5m:histogram_quantile
        expr: |
          histogram_quantile(0.99, 
            sum by (le, model, scene) (rate(ai_ttft_seconds_bucket[5m]))
          )

五、Grafana Dashboard 配置片段

下面是一个可以直接导入的 Dashboard JSON 片段，包含最核心的几个 Panel：

{
  "title": "AI Application Overview",
  "uid": "ai-app-overview",
  "panels": [
    {
      "id": 1,
      "title": "请求量（每分钟）",
      "type": "timeseries",
      "gridPos": {"x": 0, "y": 0, "w": 8, "h": 6},
      "targets": [
        {
          "expr": "sum by (model) (rate(ai_request_total[1m])) * 60",
          "legendFormat": "{{model}}",
          "refId": "A"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqpm",
          "color": {"mode": "palette-classic"}
        }
      }
    },
    {
      "id": 2,
      "title": "TTFT P50 / P99",
      "type": "timeseries",
      "gridPos": {"x": 8, "y": 0, "w": 8, "h": 6},
      "targets": [
        {
          "expr": "histogram_quantile(0.5, sum by (le, model) (rate(ai_ttft_seconds_bucket[5m]))) * 1000",
          "legendFormat": "{{model}} P50",
          "refId": "A"
        },
        {
          "expr": "histogram_quantile(0.99, sum by (le, model) (rate(ai_ttft_seconds_bucket[5m]))) * 1000",
          "legendFormat": "{{model}} P99",
          "refId": "B"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "ms",
          "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 1000},
              {"color": "red", "value": 3000}
            ]
          }
        }
      }
    },
    {
      "id": 3,
      "title": "Token 消耗趋势",
      "type": "timeseries",
      "gridPos": {"x": 16, "y": 0, "w": 8, "h": 6},
      "targets": [
        {
          "expr": "sum by (model, type) (rate(ai_token_usage_total[5m])) * 60",
          "legendFormat": "{{model}} {{type}} tokens/min",
          "refId": "A"
        }
      ],
      "fieldConfig": {
        "defaults": {"unit": "short"}
      }
    },
    {
      "id": 4,
      "title": "错误率（按类型）",
      "type": "timeseries",
      "gridPos": {"x": 0, "y": 6, "w": 12, "h": 6},
      "targets": [
        {
          "expr": "sum by (model, status) (rate(ai_request_total{status!=\"success\"}[5m])) / sum by (model, status) (rate(ai_request_total[5m]))",
          "legendFormat": "{{model}} - {{status}}",
          "refId": "A"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 0.01},
              {"color": "red", "value": 0.05}
            ]
          }
        }
      }
    },
    {
      "id": 5,
      "title": "今日估算成本（美元）",
      "type": "stat",
      "gridPos": {"x": 12, "y": 6, "w": 6, "h": 6},
      "targets": [
        {
          "expr": "sum(increase(ai_token_cost_estimate_total[24h]))",
          "legendFormat": "今日成本",
          "refId": "A"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "currencyUSD",
          "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 50},
              {"color": "red", "value": 100}
            ]
          }
        }
      }
    },
    {
      "id": 6,
      "title": "RAG 语义相关性得分",
      "type": "gauge",
      "gridPos": {"x": 18, "y": 6, "w": 6, "h": 6},
      "targets": [
        {
          "expr": "avg(ai_semantic_score)",
          "legendFormat": "平均得分",
          "refId": "A"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "min": 0,
          "max": 1,
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "yellow", "value": 0.6},
              {"color": "green", "value": 0.8}
            ]
          }
        }
      }
    }
  ]
}

六、告警通知渠道配置

最后配一下告警通知，AI 应用建议至少配两个渠道：

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'model']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # 严重错误路由到即时通知（企业微信/钉钉）
    - match:
        severity: critical
      receiver: 'im-notification'
      continue: true

    # 成本告警路由到负责人邮件
    - match:
        alertname: AiDailyCostAlert
      receiver: 'cost-owner-email'

receivers:
  - name: 'im-notification'
    webhook_configs:
      - url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY'
        send_resolved: true
        http_config:
          tls_config:
            insecure_skip_verify: false

  - name: 'cost-owner-email'
    email_configs:
      - to: 'ai-team@yourcompany.com'
        subject: '[AI成本告警] {{ .CommonAnnotations.summary }}'
        body: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

七、一个容易踩的坑：Cardinality 爆炸

配 Prometheus 指标时，有一个很容易犯的错误：把高基数字段（如用户 ID、请求 ID）放进 label。

// 错误示例——user_id 作为 label 会产生无数时间序列
registry.counter("ai_request_total",
        "model", model,
        "user_id", userId  // 千万不要这么做！
).increment();

每个不同的 label 值组合都会创建一条新的时间序列。如果把 user_id 或 trace_id 放进 label，Prometheus 的 TSDB 会迅速膨胀，性能急剧下降。

高基数字段（用户 ID、请求 ID、具体的错误消息文本）应该放进日志，不要放进指标的 label。

总结

AI 应用的监控核心是四个维度：请求域、Token 域、延迟域、质量域。告警规则的关键设计要点：

status 标签细分到 success/timeout/filtered/token_exceeded/model_error
延迟告警用比例（相对基线）而非绝对值
成本告警要做多层：单请求 Token 上限 + 小时异常 + 日累计上限
质量类告警的 for 时长要比错误类告警更长
高基数字段绝不进 label，放到日志里

这套配置实际用下来，能覆盖 AI 应用大约 80% 的生产故障场景。剩下 20% 是业务层的语义问题，需要结合上下文人工排查，指标只能作为触发信号。