Kubernetes 集群监控实战——Prometheus + Grafana + AlertManager 全套接入
Kubernetes 集群监控实战——Prometheus + Grafana + AlertManager 全套接入
适读人群:负责 K8s 集群稳定性的工程师 | 阅读时长:约 17 分钟 | 核心价值:从零搭建完整的 K8s 监控告警体系,真正用起来,而不只是装上去
我在好几个团队里见过同样的情况:Prometheus 装上去了,Grafana 也有几个漂亮的 Dashboard,但告警从来不响,或者响了没人处理,或者天天误报把告警值班群聊骚扰到被静音……
这不叫监控体系,这叫"装了个寂寞"。
这篇文章不仅写怎么装,更重要的是写怎么用起来。
架构概览
K8s 集群
├── kube-state-metrics → 采集 K8s 资源状态(Pod/Node/Deployment 状态)
├── node-exporter → 采集节点系统指标(CPU/Memory/Disk/Network)
├── cadvisor → 采集容器资源指标(kubelet 内置)
└── 应用自己暴露 /metrics → 应用业务指标
Prometheus → 定期 scrape 以上数据,存储时序数据
└── 触发告警规则 → AlertManager → 发送通知(钉钉/企业微信/PagerDuty)
Grafana → 查询 Prometheus,展示 Dashboard安装:kube-prometheus-stack
推荐用 kube-prometheus-stack 这个 Helm Chart,一键安装 Prometheus Operator + Grafana + AlertManager + kube-state-metrics + node-exporter,还附带一套预置的 Dashboard 和告警规则:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f prometheus-values.yamlprometheus-values.yaml 关键配置:
# prometheus-values.yaml
# Prometheus 配置
prometheus:
prometheusSpec:
# 数据保留 30 天
retention: 30d
retentionSize: 40GB
# 持久化存储
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
# 资源限制
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2
memory: 8Gi
# 允许 Prometheus 抓取所有命名空间的 ServiceMonitor
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
# Grafana 配置
grafana:
adminPassword: "your-admin-password"
persistence:
enabled: true
storageClassName: fast-ssd
size: 10Gi
# 默认 Dashboard
defaultDashboardsEnabled: true
# Grafana 对外暴露(通过 Ingress)
ingress:
enabled: true
hosts:
- grafana.example.com
# AlertManager 配置
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 5Gi
# node-exporter 在每台节点安装
nodeExporter:
enabled: true
# kube-state-metrics
kubeStateMetrics:
enabled: true配置 AlertManager 发送告警到企业微信
# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: production-alertmanager
namespace: monitoring
spec:
route:
receiver: 'wechat-work'
groupBy: ['alertname', 'namespace']
groupWait: 30s # 告警产生后等 30s 再发(聚合同一时间段的告警)
groupInterval: 5m # 相同分组,5 分钟后再发
repeatInterval: 4h # 持续告警,4 小时提醒一次
routes:
# Critical 告警立即发送,不等 groupWait
- matchers:
- name: severity
value: critical
receiver: 'wechat-work-critical'
groupWait: 0s
repeatInterval: 1h
# P2 告警正常流程
- matchers:
- name: severity
value: warning
receiver: 'wechat-work'
receivers:
- name: 'wechat-work'
webhookConfigs:
- url: 'http://alertmanager-webhook-adapter/wechat-work'
sendResolved: true # 告警恢复时也发通知
- name: 'wechat-work-critical'
webhookConfigs:
- url: 'http://alertmanager-webhook-adapter/wechat-work-critical'
sendResolved: true
inhibitRules:
# 有 critical 告警时,抑制相同对象的 warning 告警
- sourceMatchers:
- severity = critical
targetMatchers:
- severity = warning
equal: ['namespace', 'pod']关键告警规则
预置的告警规则已经很完整,但你需要根据自己的业务添加应用级别的告警:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
namespace: production
labels:
release: kube-prometheus-stack # 确保 Prometheus 能识别这个规则
spec:
groups:
- name: application.rules
rules:
# Pod 重启频繁告警
- alert: PodRestarting
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} 在过去1小时重启 {{ $value }} 次"
description: "namespace={{ $labels.namespace }}, container={{ $labels.container }}"
# Pod OOM Kill
- alert: PodOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} 因 OOM 被杀"
# Deployment 副本数不足
- alert: DeploymentReplicasNotReady
expr: |
kube_deployment_status_replicas_ready / kube_deployment_spec_replicas < 0.7
for: 5m
labels:
severity: critical
annotations:
summary: "Deployment {{ $labels.deployment }} 只有 {{ $value | humanizePercentage }} 副本就绪"
# HTTP 错误率告警
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 3m
labels:
severity: critical
annotations:
summary: "服务 {{ $labels.service }} 5xx 错误率超过 5%,当前 {{ $value | humanizePercentage }}"
# P99 延迟告警
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "服务 {{ $labels.service }} P99 延迟 {{ $value }}s,超过 2s"
# 节点磁盘使用率
- alert: NodeDiskUsageHigh
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.node }} 磁盘使用率 {{ $value | humanizePercentage }}"踩坑实录一:告警一直在 pending 状态,不触发
现象:Prometheus 里看到告警规则是 firing 状态,但 AlertManager 里没有收到,企业微信也没有通知。
原因:AlertManager 配置有误,或者 Prometheus 没有正确连接到 AlertManager。
排查步骤:
# 查看 AlertManager 状态
kubectl get pods -n monitoring | grep alertmanager
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0
# 检查 Prometheus 能否访问 AlertManager
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
wget -O- http://kube-prometheus-stack-alertmanager:9093/-/healthy
# 查看 Prometheus 的 AlertManager 配置
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
cat /etc/prometheus/config_out/prometheus.env.yaml | grep alertmanager我那次的问题是 PrometheusRule 的 label 不匹配——Prometheus 通过 label selector 来识别哪些 PrometheusRule 是自己的,如果 label 不对,规则压根不会被加载。
踩坑实录二:告警风暴,一次事故触发了 200+ 条告警
现象:一台节点挂了,触发了几十个 Pod 的告警,每个 Pod 有 3~4 条规则,一下子推送了 200+ 条告警到群里,值班同学根本看不过来,直接把群静音了。
原因:没有配置告警聚合和抑制规则。
解法:
- AlertManager 的 groupBy 配置告警聚合(同一时间段,同一 namespace 的告警合并成一条)
- inhibitRules 配置抑制(节点 Down 时,抑制这台节点上所有 Pod 的告警,因为 Pod 告警是"因",节点挂才是"果"):
inhibitRules:
# 节点 Down 时,抑制节点上的 Pod 告警
- sourceMatchers:
- alertname = NodeDown
targetMatchers:
- alertname =~ "Pod.*"
equal: ['node']踩坑实录三:Grafana 面板数据不对,和 kubectl top 对不上
现象:Grafana 里显示某个 Pod 的 CPU 使用率是 23%,kubectl top pod 显示是 47%,差了一倍。
原因:两者使用的计算方式不同:
kubectl top用的是 metrics-server 的数据,单位是实际 CPU 核数的瞬时值- Grafana 里的 Dashboard 用的 PromQL 可能用的是
rate(container_cpu_usage_seconds_total[5m]),这是 5 分钟的平均值,而且分母(CPU limit)可能配置不对
解法:检查 Grafana Dashboard 里的 PromQL 公式,确保分母用的是 kube_pod_container_resource_requests{resource="cpu"}(requests),而不是固定值。标准的 CPU 使用率计算:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)应用暴露 /metrics 接口
应用自己的业务指标需要通过 Prometheus Client 库暴露:
# Python (FastAPI) 示例
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import FastAPI, Response
app = FastAPI()
# 定义指标
REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'path', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'Request duration', ['path'])
@app.middleware("http")
async def metrics_middleware(request, call_next):
import time
start = time.time()
response = await call_next(request)
duration = time.time() - start
REQUEST_COUNT.labels(
method=request.method,
path=request.url.path,
status=response.status_code
).inc()
REQUEST_DURATION.labels(path=request.url.path).observe(duration)
return response
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)然后创建 ServiceMonitor 让 Prometheus 自动发现这个应用的 metrics 端点:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
namespace: production
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: http
path: /metrics
interval: 15s监控体系搭起来只是第一步,真正让它发挥价值,需要持续投入:告警规则要不断调整(去掉误报,补充漏报),Dashboard 要结合实际业务来建,值班流程要真正跑起来。
这是一个持续迭代的过程,没有一步到位的配置。
