企业级RAG运维手册:从部署到日常维护的完整指南
企业级RAG运维手册:从部署到日常维护的完整指南
凌晨3点的电话
2025年9月,杭州某电商平台的Java工程师小陈接到了一个不该在深夜响起的电话。
那是公司值班机器人拨来的:RAG知识库服务HTTP 503,已持续7分钟,影响客服机器人全线异常,工单量激增。
小陈从床上爬起来,打开电脑。问题来了——他根本不知道从哪里开始查。
RAG系统是他三个月前搭的:Spring Boot后端、Milvus向量数据库、OpenAI嵌入API、GPT-4o生成答案。开发阶段跑得好好的,测试也没问题。但维护体系是零:没有集中日志、没有监控面板、没有故障手册。
他登到服务器,只看到一堆Error日志,但看不出根因。翻GitHub Issues翻了半小时,重启了服务,503消失了,但不知道为什么出的问题,也不知道还会不会再来。
第二天早上开会,运维总监问:"你能告诉我昨晚故障的根本原因吗?"
小陈沉默了。
这次事故让公司CTO下了一个决定:RAG系统在投入生产前,必须有完整的运维体系,否则不准上线。
小陈花了两周时间,把监控、告警、故障手册全部补全。本文就是他那两周工作的结晶——一份可以直接复用的RAG运维手册。
先说结论(TL;DR)
| 运维维度 | 核心工具 | 关键指标 | 告警阈值 |
|---|---|---|---|
| 应用监控 | Micrometer + Prometheus | 请求延迟P99、错误率 | P99>5s,错误率>5% |
| 向量数据库 | Milvus内置监控 | 查询延迟、内存使用 | 查询P99>500ms,内存>80% |
| 嵌入服务 | 自定义指标 | API调用成功率、延迟 | 成功率<95%,P99>3s |
| LLM服务 | 自定义指标 | Token消耗、错误率 | 错误率>3%,延迟>30s |
| 基础设施 | Node Exporter | CPU、内存、磁盘 | CPU>80%持续5min,磁盘>85% |
运维黄金法则: 无法量化的系统,无法被有效管理。先建监控,再谈优化。
RAG系统核心组件及故障模式
系统架构总览
各组件故障模式分析
| 组件 | 常见故障 | 故障影响 | 检测方法 |
|---|---|---|---|
| 嵌入服务 | API限流、超时、模型版本变更 | 无法处理新查询 | HTTP健康检查 |
| 向量数据库 | OOM、磁盘满、慢查询 | 检索超时、数据丢失 | Metrics + 慢查询日志 |
| LLM服务 | API限流、网络超时、配额耗尽 | 无法生成答案 | 错误率监控 |
| Redis缓存 | 内存满、连接数超限 | 缓存失效,负载升高 | Redis监控 |
| 应用服务 | JVM OOM、线程池满、配置错误 | 服务不可用 | JVM metrics |
部署架构:高可用RAG生产方案
Docker Compose生产配置
# docker-compose.prod.yml
version: '3.8'
services:
# ========== RAG应用层 ==========
rag-app-1:
image: rag-service:latest
container_name: rag-app-1
restart: always
environment:
- SPRING_PROFILES_ACTIVE=prod
- SERVER_PORT=8080
- INSTANCE_ID=app-1
- OPENAI_API_KEY=${OPENAI_API_KEY}
- MILVUS_HOST=milvus-standalone
- MILVUS_PORT=19530
- REDIS_HOST=redis-master
ports:
- "8081:8080"
depends_on:
milvus-standalone:
condition: service_healthy
redis-master:
condition: service_healthy
networks:
- rag-network
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
memory: 2G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
rag-app-2:
image: rag-service:latest
container_name: rag-app-2
restart: always
environment:
- SPRING_PROFILES_ACTIVE=prod
- SERVER_PORT=8080
- INSTANCE_ID=app-2
- OPENAI_API_KEY=${OPENAI_API_KEY}
- MILVUS_HOST=milvus-standalone
- MILVUS_PORT=19530
- REDIS_HOST=redis-master
ports:
- "8082:8080"
depends_on:
- rag-app-1
networks:
- rag-network
deploy:
resources:
limits:
cpus: '2'
memory: 4G
# ========== 负载均衡 ==========
nginx:
image: nginx:1.25-alpine
container_name: rag-nginx
restart: always
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./nginx/ssl:/etc/nginx/ssl:ro
depends_on:
- rag-app-1
- rag-app-2
networks:
- rag-network
# ========== 向量数据库 ==========
etcd:
image: quay.io/coreos/etcd:v3.5.5
container_name: milvus-etcd
restart: always
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
volumes:
- etcd_data:/etcd
command: >
etcd
--advertise-client-urls=http://127.0.0.1:2379
--listen-client-urls http://0.0.0.0:2379
--data-dir /etcd
networks:
- rag-network
minio:
image: minio/minio:RELEASE.2023-03-13T19-46-17Z
container_name: milvus-minio
restart: always
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: ${MINIO_SECRET_KEY}
volumes:
- minio_data:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
networks:
- rag-network
milvus-standalone:
image: milvusdb/milvus:v2.4.0
container_name: milvus-standalone
restart: always
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- milvus_data:/var/lib/milvus
- ./milvus/milvus.yaml:/milvus/configs/milvus.yaml:ro
ports:
- "19530:19530"
- "9091:9091" # 监控端口
depends_on:
- etcd
- minio
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
timeout: 20s
retries: 3
networks:
- rag-network
# ========== 缓存 ==========
redis-master:
image: redis:7.2-alpine
container_name: redis-master
restart: always
command: >
redis-server
--requirepass ${REDIS_PASSWORD}
--maxmemory 2gb
--maxmemory-policy allkeys-lru
--save 900 1
--save 300 10
volumes:
- redis_data:/data
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
interval: 30s
timeout: 10s
retries: 3
networks:
- rag-network
# ========== 监控 ==========
prometheus:
image: prom/prometheus:v2.47.0
container_name: prometheus
restart: always
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/alerts:/etc/prometheus/alerts:ro
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
networks:
- rag-network
grafana:
image: grafana/grafana:10.1.0
container_name: grafana
restart: always
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
ports:
- "3000:3000"
networks:
- rag-network
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
restart: always
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
networks:
- rag-network
volumes:
etcd_data:
minio_data:
milvus_data:
redis_data:
prometheus_data:
grafana_data:
networks:
rag-network:
driver: bridgeNginx负载均衡配置
# nginx/nginx.conf
upstream rag_backend {
least_conn;
server rag-app-1:8080 weight=1 max_fails=3 fail_timeout=30s;
server rag-app-2:8080 weight=1 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name rag.your-company.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name rag.your-company.com;
ssl_certificate /etc/nginx/ssl/cert.pem;
ssl_certificate_key /etc/nginx/ssl/key.pem;
# 超时配置(RAG响应可能较慢)
proxy_read_timeout 60s;
proxy_connect_timeout 10s;
proxy_send_timeout 60s;
location /api/ {
proxy_pass http://rag_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# 健康检查失败时快速失败
proxy_next_upstream error timeout http_503;
proxy_next_upstream_tries 2;
}
# 健康检查端点
location /actuator/health {
proxy_pass http://rag_backend;
access_log off;
}
}监控体系:全面覆盖的指标收集
Spring Boot监控配置
# application-prod.yml
spring:
application:
name: rag-service
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
show-details: always
show-components: always
metrics:
tags:
application: ${spring.application.name}
instance: ${INSTANCE_ID:unknown}
environment: production
distribution:
percentiles-histogram:
http.server.requests: true
rag.retrieval.latency: true
rag.embedding.latency: true
rag.llm.latency: true
percentiles:
http.server.requests: 0.5,0.75,0.95,0.99
rag.retrieval.latency: 0.5,0.95,0.99
prometheus:
metrics:
export:
enabled: true
step: 15s
logging:
level:
com.example.rag: INFO
org.springframework.ai: DEBUG
pattern:
console: "%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg traceId=%X{traceId} %n"
file:
name: /var/log/rag-service/application.log
max-size: 100MB
max-history: 30自定义业务监控
// RagMetricsService.java
package com.example.rag.monitoring;
import io.micrometer.core.instrument.*;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;
import java.time.Duration;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;
/**
* RAG系统业务指标收集服务
*
* 暴露以下指标给Prometheus:
* - rag_queries_total: 总查询次数(按结果分类)
* - rag_retrieval_duration_seconds: 检索耗时分布
* - rag_embedding_duration_seconds: 嵌入耗时分布
* - rag_llm_duration_seconds: LLM耗时分布
* - rag_retrieved_documents: 每次检索返回的文档数
* - rag_active_queries: 当前活跃查询数
* - rag_knowledge_base_size: 知识库文档总数
*/
@Slf4j
@Service
public class RagMetricsService {
private final MeterRegistry registry;
// 计数器
private final Counter querySuccessCounter;
private final Counter queryFailCounter;
private final Counter queryEmptyResultCounter;
private final Counter embeddingApiErrorCounter;
private final Counter llmApiErrorCounter;
private final Counter rerankerFallbackCounter;
// 分布摘要
private final DistributionSummary retrievedDocsSummary;
private final DistributionSummary tokenUsageSummary;
// 活跃请求数(仪表)
private final AtomicInteger activeQueries = new AtomicInteger(0);
// 知识库大小
private final AtomicLong knowledgeBaseSize = new AtomicLong(0);
public RagMetricsService(MeterRegistry registry) {
this.registry = registry;
// 初始化计数器
querySuccessCounter = Counter.builder("rag.queries.total")
.description("RAG查询总次数")
.tag("result", "success")
.register(registry);
queryFailCounter = Counter.builder("rag.queries.total")
.description("RAG查询总次数")
.tag("result", "error")
.register(registry);
queryEmptyResultCounter = Counter.builder("rag.queries.total")
.description("RAG查询总次数")
.tag("result", "empty")
.register(registry);
embeddingApiErrorCounter = Counter.builder("rag.embedding.errors.total")
.description("嵌入API错误次数")
.register(registry);
llmApiErrorCounter = Counter.builder("rag.llm.errors.total")
.description("LLM API错误次数")
.register(registry);
rerankerFallbackCounter = Counter.builder("rag.reranker.fallback.total")
.description("Reranker降级次数")
.register(registry);
// 文档数分布
retrievedDocsSummary = DistributionSummary.builder("rag.retrieved.documents")
.description("每次检索返回的文档数")
.baseUnit("documents")
.register(registry);
// Token使用分布
tokenUsageSummary = DistributionSummary.builder("rag.token.usage")
.description("每次LLM调用的token消耗")
.baseUnit("tokens")
.register(registry);
// 活跃查询数(仪表)
Gauge.builder("rag.active.queries", activeQueries, AtomicInteger::get)
.description("当前正在处理的查询数")
.register(registry);
// 知识库大小(仪表)
Gauge.builder("rag.knowledge.base.size", knowledgeBaseSize, AtomicLong::get)
.description("知识库文档总数")
.register(registry);
}
/**
* 记录完整RAG查询指标
*/
public void recordQuerySuccess(long retrievalMs, long embeddingMs, long llmMs,
int docCount, int tokenCount) {
querySuccessCounter.increment();
retrievedDocsSummary.record(docCount);
tokenUsageSummary.record(tokenCount);
// 记录各阶段延迟
registry.timer("rag.retrieval.duration").record(Duration.ofMillis(retrievalMs));
registry.timer("rag.embedding.duration").record(Duration.ofMillis(embeddingMs));
registry.timer("rag.llm.duration").record(Duration.ofMillis(llmMs));
}
public void recordQueryError(String errorType) {
queryFailCounter.increment();
registry.counter("rag.errors.total", "type", errorType).increment();
}
public void recordQueryEmpty() {
queryEmptyResultCounter.increment();
}
public void recordEmbeddingError() {
embeddingApiErrorCounter.increment();
}
public void recordLlmError() {
llmApiErrorCounter.increment();
}
public void recordRerankerFallback() {
rerankerFallbackCounter.increment();
}
public void incrementActiveQueries() {
activeQueries.incrementAndGet();
}
public void decrementActiveQueries() {
activeQueries.decrementAndGet();
}
public void updateKnowledgeBaseSize(long size) {
knowledgeBaseSize.set(size);
}
}Prometheus配置
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'prod-rag'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- /etc/prometheus/alerts/*.yml
scrape_configs:
# RAG应用实例
- job_name: 'rag-apps'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['rag-app-1:8080', 'rag-app-2:8080']
relabel_configs:
- source_labels: [__address__]
target_label: instance
# Milvus向量数据库
- job_name: 'milvus'
static_configs:
- targets: ['milvus-standalone:9091']
metrics_path: '/metrics'
# Redis
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
# Node级别指标
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# JVM指标
- job_name: 'jvm'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['rag-app-1:8080', 'rag-app-2:8080']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'jvm_.*'
action: keep关键告警规则
# monitoring/alerts/rag_alerts.yml
groups:
- name: rag_application
rules:
# =========== 可用性告警 ===========
- alert: RagServiceDown
expr: up{job="rag-apps"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "RAG服务实例宕机"
description: "实例 {{ $labels.instance }} 已宕机超过1分钟,立即处理!"
- alert: RagAllInstancesDown
expr: count(up{job="rag-apps"} == 1) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "RAG服务全部实例宕机!"
description: "所有RAG服务实例均不可用,生产事故!"
# =========== 延迟告警 ===========
- alert: RagHighP99Latency
expr: >
histogram_quantile(0.99,
rate(http_server_requests_seconds_bucket{
job="rag-apps", uri="/api/rag/query"
}[5m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "RAG接口P99延迟超过10秒"
description: "实例 {{ $labels.instance }} P99延迟 {{ $value | humanizeDuration }},检查LLM和检索服务"
- alert: RagEmbeddingHighLatency
expr: histogram_quantile(0.95, rate(rag_embedding_duration_seconds_bucket[5m])) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "嵌入服务P95延迟超过3秒"
description: "可能是OpenAI API限流或网络问题"
# =========== 错误率告警 ===========
- alert: RagHighErrorRate
expr: >
rate(rag_queries_total{result="error"}[5m])
/ rate(rag_queries_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "RAG查询错误率超过5%"
description: "当前错误率: {{ $value | humanizePercentage }}"
- alert: RagCriticalErrorRate
expr: >
rate(rag_queries_total{result="error"}[5m])
/ rate(rag_queries_total[5m]) > 0.20
for: 2m
labels:
severity: critical
annotations:
summary: "RAG查询错误率超过20%!"
description: "严重错误,立即处理!当前错误率: {{ $value | humanizePercentage }}"
- alert: RagHighEmptyResultRate
expr: >
rate(rag_queries_total{result="empty"}[10m])
/ rate(rag_queries_total[10m]) > 0.30
for: 10m
labels:
severity: warning
annotations:
summary: "RAG空结果率超过30%"
description: "知识库可能存在问题,检查向量数据库数据完整性"
- name: milvus_alerts
rules:
# =========== Milvus告警 ===========
- alert: MilvusHighQueryLatency
expr: milvus_querynode_sq_latency_bucket{le="0.5"} < 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Milvus查询延迟升高"
description: "超过20%的查询延迟>500ms,检查集合索引和内存"
- alert: MilvusHighMemoryUsage
expr: >
milvus_querynode_entity_num / milvus_querynode_loaded_segments > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Milvus内存使用率超过80%"
- alert: MilvusDiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/milvus"} /
node_filesystem_size_bytes{mountpoint="/var/lib/milvus"}) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Milvus存储空间不足20%"
description: "剩余空间: {{ $value | humanizePercentage }},尽快扩容"
- name: infrastructure_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "服务器CPU使用率超过80%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "服务器内存使用率超过85%"
- alert: JvmHighHeapUsage
expr: >
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "JVM堆内存使用率超过85%,可能发生OOM"AlertManager通知配置
# monitoring/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
repeat_interval: 1h
- match:
severity: warning
receiver: 'warning-alerts'
repeat_interval: 4h
receivers:
- name: 'default'
webhook_configs:
- url: 'http://your-webhook-service/alert'
- name: 'critical-alerts'
# 钉钉通知(生产事故)
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=${DINGTALK_TOKEN}'
send_resolved: true
http_config:
bearer_token: ''
title: '🚨 [RAG生产告警] {{ .GroupLabels.alertname }}'
text: |
**告警级别**: Critical
**告警名称**: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
**告警描述**: {{ range .Alerts }}{{ .Annotations.description }}{{ end }}
**发生时间**: {{ .Alerts | len }} 条告警
pagerduty_configs:
- service_key: ${PAGERDUTY_KEY}
- name: 'warning-alerts'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=${DINGTALK_TOKEN}'
send_resolved: true故障排查手册
故障排查流程
诊断命令速查手册
#!/bin/bash
# rag_diagnosis.sh - RAG故障快速诊断脚本
echo "=========================================="
echo " RAG系统快速诊断工具"
echo "=========================================="
# 1. 检查服务实例状态
echo ""
echo "【1】检查服务实例健康状态"
echo "--------------------------------------------"
for port in 8081 8082; do
status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:${port}/actuator/health)
if [ "$status" = "200" ]; then
echo " ✓ 实例 :${port} 健康"
else
echo " ✗ 实例 :${port} 异常 (HTTP ${status})"
fi
done
# 2. 检查Milvus状态
echo ""
echo "【2】检查Milvus向量数据库状态"
echo "--------------------------------------------"
milvus_status=$(curl -s http://localhost:9091/healthz)
echo " Milvus健康状态: ${milvus_status}"
# 查询Milvus集合信息
echo " Milvus集合列表:"
curl -s "http://localhost:9091/api/v1/collections" | python3 -m json.tool 2>/dev/null | head -20
# 3. 检查Redis状态
echo ""
echo "【3】检查Redis状态"
echo "--------------------------------------------"
redis-cli -a ${REDIS_PASSWORD} ping 2>/dev/null && echo " ✓ Redis连接正常" || echo " ✗ Redis连接失败"
redis-cli -a ${REDIS_PASSWORD} info memory 2>/dev/null | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"
# 4. 检查外部API连通性
echo ""
echo "【4】检查外部API连通性"
echo "--------------------------------------------"
openai_status=$(curl -s -o /dev/null -w "%{http_code}" -m 5 https://api.openai.com/v1/models \
-H "Authorization: Bearer ${OPENAI_API_KEY}")
echo " OpenAI API: HTTP ${openai_status}"
# 5. 检查JVM状态
echo ""
echo "【5】检查JVM内存状态"
echo "--------------------------------------------"
curl -s http://localhost:8081/actuator/metrics/jvm.memory.used | \
python3 -c "import json,sys; d=json.load(sys.stdin); [print(f' {m[\"statistic\"]}: {m[\"value\"]/1024/1024:.1f}MB') for m in d['measurements']]" 2>/dev/null
# 6. 最近错误日志
echo ""
echo "【6】最近10条错误日志"
echo "--------------------------------------------"
docker logs rag-app-1 --tail 100 2>&1 | grep -i "error\|exception\|warn" | tail -10
# 7. 近5分钟请求统计
echo ""
echo "【7】近5分钟请求统计(Prometheus查询)"
echo "--------------------------------------------"
echo " 请运行: curl 'http://localhost:9090/api/v1/query?query=rate(rag_queries_total[5m])'"
echo ""
echo "=========================================="
echo " 诊断完成"
echo "=========================================="常见故障解决方案
// TroubleshootingGuide.java - 用注释形式呈现排查手册
/**
* ==========================================
* 故障1:Milvus查询超时(检索延迟>1s)
* ==========================================
*
* 症状:向量检索P99超时,rag_retrieval_duration_seconds P99 > 1s
*
* 可能原因:
* 1. 集合未加载到内存(最常见!)
* 2. 索引参数不合理(nprobe太大)
* 3. Milvus节点内存不足,触发交换
* 4. 向量维度与索引不匹配
*
* 排查命令:
* ```python
* from pymilvus import connections, Collection, utility
* connections.connect("default", host="localhost", port="19530")
*
* # 检查集合加载状态
* col = Collection("knowledge_base")
* print(col.get_load_state()) # 应该是 Loaded
*
* # 如果未加载,执行加载
* col.load()
*
* # 检查索引信息
* print(col.index().params)
* ```
*
* 解决方案:
* 1. 确保集合已加载:col.load()
* 2. 优化nprobe参数(IVF_FLAT索引建议nprobe=32~128)
* 3. 增加Milvus内存配置
*
* ==========================================
* 故障2:嵌入API频繁503/429
* ==========================================
*
* 症状:rag_embedding_errors_total 持续增长
*
* 可能原因:
* 1. OpenAI API限流(Rate Limit Exceeded)
* 2. 网络波动
* 3. API Key失效
*
* 排查步骤:
* curl -s https://api.openai.com/v1/embeddings \
* -H "Authorization: Bearer ${OPENAI_API_KEY}" \
* -H "Content-Type: application/json" \
* -d '{"model":"text-embedding-3-small","input":"test"}'
*
* 解决方案:
* 1. 429限流:降低并发,加指数退避重试
* 2. 网络问题:检查出口IP是否被封,考虑切换API endpoint
* 3. Key失效:更新API Key
*
* ==========================================
* 故障3:JVM OOM(OutOfMemoryError)
* ==========================================
*
* 症状:服务宕机,堆内存使用持续增长
*
* 原因分析:
* 1. 检索到的文档内容太长,加载过多到内存
* 2. 向量数据缓存未限制大小
* 3. 线程池无限增长
*
* 解决方案:
* -Xmx4g -Xms2g # 限制堆内存
* -XX:+HeapDumpOnOutOfMemoryError # OOM时自动dump
* -XX:HeapDumpPath=/tmp/oom.hprof
*
* 代码层面:
* - 限制每次检索的文档长度(truncate to 2000 chars)
* - 使用Caffeine缓存并设置maxSize
* - 线程池使用有界队列
*/
public class TroubleshootingGuide {}数据备份:向量数据库备份与恢复
Milvus备份方案
#!/bin/bash
# milvus_backup.sh - Milvus数据备份脚本
set -euo pipefail
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backup/milvus/${BACKUP_DATE}"
MINIO_BUCKET="milvus-backup"
RETENTION_DAYS=7
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}
# 1. 创建备份目录
log "创建备份目录: ${BACKUP_DIR}"
mkdir -p "${BACKUP_DIR}"
# 2. 使用Milvus Backup工具备份
log "开始备份Milvus数据..."
docker run --rm \
--network rag-network \
-v "${BACKUP_DIR}:/backup" \
milvusdb/milvus-backup:latest \
backup \
--config /backup/backup.yaml \
--milvus-host milvus-standalone \
--milvus-port 19530 \
--name "backup_${BACKUP_DATE}" \
--collections knowledge_base
log "Milvus备份完成"
# 3. 同时备份MinIO中的原始向量数据
log "备份MinIO数据..."
docker run --rm \
--network rag-network \
-v "${BACKUP_DIR}:/backup" \
minio/mc:latest \
mirror \
--overwrite \
myminio/milvus-bucket \
/backup/minio_data
log "MinIO备份完成,大小: $(du -sh ${BACKUP_DIR} | cut -f1)"
# 4. 压缩备份
log "压缩备份文件..."
cd /backup/milvus
tar -czf "milvus_backup_${BACKUP_DATE}.tar.gz" "${BACKUP_DATE}/"
BACKUP_SIZE=$(du -sh "milvus_backup_${BACKUP_DATE}.tar.gz" | cut -f1)
log "压缩完成,大小: ${BACKUP_SIZE}"
# 5. 上传到远程存储(S3/阿里云OSS)
log "上传到远程存储..."
aws s3 cp "milvus_backup_${BACKUP_DATE}.tar.gz" \
"s3://${MINIO_BUCKET}/milvus_backup_${BACKUP_DATE}.tar.gz" \
--storage-class STANDARD_IA
log "上传完成"
# 6. 清理本地过期备份(保留7天)
log "清理过期备份..."
find /backup/milvus -name "*.tar.gz" -mtime +${RETENTION_DAYS} -delete
find /backup/milvus -type d -mtime +${RETENTION_DAYS} -exec rm -rf {} + 2>/dev/null || true
# 7. 验证备份完整性
log "验证备份文件..."
if tar -tzf "milvus_backup_${BACKUP_DATE}.tar.gz" > /dev/null 2>&1; then
log "✓ 备份文件完整性验证通过"
else
log "✗ 备份文件损坏!发送告警..."
# 发送告警...
exit 1
fi
# 8. 发送备份报告
log "发送备份完成通知..."
curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=${DINGTALK_TOKEN}" \
-H 'Content-Type: application/json' \
-d "{
\"msgtype\": \"text\",
\"text\": {
\"content\": \"[RAG备份] Milvus备份成功\\n时间: ${BACKUP_DATE}\\n大小: ${BACKUP_SIZE}\\n保留天数: ${RETENTION_DAYS}天\"
}
}" 2>/dev/null || true
log "=== 备份完成 ==="恢复流程
#!/bin/bash
# milvus_restore.sh - Milvus数据恢复脚本
BACKUP_DATE=$1 # 参数:要恢复的备份日期,如20251201_030000
if [ -z "${BACKUP_DATE}" ]; then
echo "用法: $0 <backup_date>"
echo "例子: $0 20251201_030000"
exit 1
fi
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}
# 1. 下载备份文件
log "从S3下载备份: ${BACKUP_DATE}"
aws s3 cp \
"s3://milvus-backup/milvus_backup_${BACKUP_DATE}.tar.gz" \
"/tmp/milvus_backup_${BACKUP_DATE}.tar.gz"
# 2. 解压
log "解压备份文件..."
tar -xzf "/tmp/milvus_backup_${BACKUP_DATE}.tar.gz" -C /tmp/
# 3. 停止RAG应用(避免写入冲突)
log "停止RAG应用实例..."
docker stop rag-app-1 rag-app-2 || true
# 4. 执行Milvus恢复
log "开始恢复Milvus数据..."
docker run --rm \
--network rag-network \
-v "/tmp/${BACKUP_DATE}:/backup" \
milvusdb/milvus-backup:latest \
restore \
--milvus-host milvus-standalone \
--milvus-port 19530 \
--name "backup_${BACKUP_DATE}" \
--rename-collections knowledge_base:knowledge_base_restored
log "数据恢复完成"
# 5. 验证恢复结果
log "验证恢复结果..."
python3 << 'EOF'
from pymilvus import connections, Collection, utility
connections.connect("default", host="localhost", port="19530")
if utility.has_collection("knowledge_base_restored"):
col = Collection("knowledge_base_restored")
count = col.num_entities
print(f" 恢复的集合文档数: {count}")
if count > 0:
print(" ✓ 恢复验证通过")
else:
print(" ✗ 恢复失败,文档数为0")
else:
print(" ✗ 集合不存在")
EOF
# 6. 切换集合(原子操作)
log "切换到恢复的集合..."
# 实际操作中,需要更新应用配置或使用集合别名
# 7. 重启应用
log "重启RAG应用..."
docker start rag-app-1 rag-app-2
log "=== 恢复完成 ==="容量规划
存储容量计算
// CapacityPlanner.java
/**
* RAG系统容量规划计算器
*
* 向量存储计算:
* - 每个向量维度:1536(text-embedding-3-small)
* - 每个维度:4字节(float32)
* - 每个向量:1536 * 4 = 6144字节 ≈ 6KB
*
* 存储估算:
* - 10万文档:100,000 * 6KB = 600MB
* - 100万文档:100万 * 6KB = 6GB
* - 1000万文档:1000万 * 6KB = 60GB
*
* Milvus内存要求(向量需要加载到内存):
* - 建议内存 = 向量存储大小 * 1.5(索引开销)
* - 100万文档:6GB * 1.5 = 9GB内存
*
* 服务器配置建议:
*
* 小规模(10万文档以内):
* - CPU: 4核
* - 内存: 8GB(Milvus 4GB + OS 4GB)
* - 磁盘: 50GB SSD
* - 月费用: ~$100(ECS/EC2)
*
* 中规模(100万文档以内):
* - CPU: 8核
* - 内存: 32GB(Milvus 16GB + OS 16GB)
* - 磁盘: 200GB SSD
* - 月费用: ~$500(ECS/EC2)
*
* 大规模(1000万文档):
* - 建议使用Milvus集群模式
* - 内存: 128GB+
* - 考虑独立QueryNode / DataNode
*/
public class CapacityPlanner {
static final int VECTOR_DIM = 1536;
static final int BYTES_PER_FLOAT = 4;
static final double INDEX_OVERHEAD = 1.5;
static final double SAFETY_MARGIN = 1.3; // 30%安全余量
public static long estimateStorageBytes(long documentCount) {
long vectorBytes = documentCount * VECTOR_DIM * BYTES_PER_FLOAT;
// 加上元数据存储(约每文档1KB)
long metadataBytes = documentCount * 1024;
return vectorBytes + metadataBytes;
}
public static long estimateMemoryBytes(long documentCount) {
long storageBytes = estimateStorageBytes(documentCount);
return (long) (storageBytes * INDEX_OVERHEAD * SAFETY_MARGIN);
}
public static void printCapacityReport(long documentCount) {
long storageGB = estimateStorageBytes(documentCount) / 1024 / 1024 / 1024;
long memoryGB = estimateMemoryBytes(documentCount) / 1024 / 1024 / 1024;
System.out.printf("=== 容量规划报告(%,d 文档)===%n", documentCount);
System.out.printf("存储需求: %d GB%n", Math.max(1, storageGB));
System.out.printf("内存需求: %d GB%n", Math.max(4, memoryGB));
}
}零停机版本升级
Milvus升级流程
#!/bin/bash
# milvus_upgrade.sh - Milvus零停机升级
TARGET_VERSION=$1 # 目标版本,如v2.4.1
CURRENT_VERSION=$(docker inspect milvus-standalone --format '{{.Config.Image}}' | cut -d: -f2)
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}
log "升级Milvus: ${CURRENT_VERSION} → ${TARGET_VERSION}"
# 步骤1:备份数据(必须!)
log "步骤1/5: 备份数据..."
./milvus_backup.sh
# 步骤2:拉取新版本镜像
log "步骤2/5: 拉取新版本镜像..."
docker pull milvusdb/milvus:${TARGET_VERSION}
# 步骤3:停止写入(RAG只读模式)
log "步骤3/5: 切换RAG应用到只读模式..."
# 通知应用层禁止新文档写入
curl -X POST http://localhost:8081/admin/maintenance/readonly
curl -X POST http://localhost:8082/admin/maintenance/readonly
# 等待现有写入完成
log "等待进行中的写入完成..."
sleep 10
# 步骤4:升级Milvus
log "步骤4/5: 重启Milvus为新版本..."
docker stop milvus-standalone
docker rm milvus-standalone
docker run -d \
--name milvus-standalone \
--network rag-network \
-e ETCD_ENDPOINTS=etcd:2379 \
-e MINIO_ADDRESS=minio:9000 \
-v milvus_data:/var/lib/milvus \
milvusdb/milvus:${TARGET_VERSION} \
milvus run standalone
# 等待Milvus启动
log "等待Milvus启动..."
for i in {1..30}; do
if curl -s http://localhost:9091/healthz | grep -q "OK"; then
log "✓ Milvus启动成功"
break
fi
sleep 5
echo " 等待中... (${i}/30)"
done
# 步骤5:恢复读写
log "步骤5/5: 恢复RAG应用读写模式..."
curl -X POST http://localhost:8081/admin/maintenance/readwrite
curl -X POST http://localhost:8082/admin/maintenance/readwrite
log "✓ Milvus升级完成: ${TARGET_VERSION}"
# 冒烟测试
log "执行冒烟测试..."
TEST_RESULT=$(curl -s -X POST http://localhost:8081/api/rag/query \
-H "Content-Type: application/json" \
-d '{"question":"test"}')
if echo "${TEST_RESULT}" | grep -q "answer"; then
log "✓ 冒烟测试通过"
else
log "✗ 冒烟测试失败!执行回滚..."
# 回滚逻辑...
fi运维自动化:健康检查脚本
// HealthCheckScheduler.java
package com.example.rag.monitoring;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.web.reactive.function.client.WebClient;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
/**
* RAG系统定时健康检查
*
* 每分钟执行一次全面健康检查,并汇报指标
*/
@Slf4j
@Component
@RequiredArgsConstructor
public class HealthCheckScheduler implements HealthIndicator {
private final VectorStore vectorStore;
private final RagMetricsService metricsService;
private final WebClient.Builder webClientBuilder;
@Value("${spring.ai.openai.api-key}")
private String openaiApiKey;
// 最后一次健康检查结果
private final AtomicReference<HealthStatus> lastHealthStatus =
new AtomicReference<>(new HealthStatus());
@Override
public Health health() {
HealthStatus status = lastHealthStatus.get();
if (status.isHealthy()) {
return Health.up()
.withDetails(status.getDetails())
.build();
} else {
return Health.down()
.withDetails(status.getDetails())
.build();
}
}
/**
* 每分钟定时健康检查
*/
@Scheduled(fixedRate = 60000)
public void scheduledHealthCheck() {
log.debug("执行定时健康检查...");
HealthStatus status = new HealthStatus();
// 1. 检查向量数据库连接
checkVectorStore(status);
// 2. 检查OpenAI API
checkOpenAIApi(status);
// 3. 检查知识库数据完整性
checkKnowledgeBaseIntegrity(status);
lastHealthStatus.set(status);
if (!status.isHealthy()) {
log.warn("健康检查发现问题: {}", status.getIssues());
}
}
private void checkVectorStore(HealthStatus status) {
try {
long start = System.currentTimeMillis();
// 执行一次简单查询验证向量库连通性
var results = vectorStore.similaritySearch("health check test query");
long elapsed = System.currentTimeMillis() - start;
status.getDetails().put("vectorStore.latencyMs", elapsed);
status.getDetails().put("vectorStore.status", "UP");
if (elapsed > 1000) {
status.addIssue("向量数据库响应慢: " + elapsed + "ms");
}
} catch (Exception e) {
status.setHealthy(false);
status.addIssue("向量数据库异常: " + e.getMessage());
status.getDetails().put("vectorStore.status", "DOWN");
log.error("向量数据库健康检查失败", e);
}
}
private void checkOpenAIApi(HealthStatus status) {
try {
long start = System.currentTimeMillis();
var webClient = webClientBuilder.build();
// 发送一个最小的嵌入请求验证API可用性
webClient.post()
.uri("https://api.openai.com/v1/embeddings")
.header("Authorization", "Bearer " + openaiApiKey)
.bodyValue(Map.of(
"model", "text-embedding-3-small",
"input", "health"
))
.retrieve()
.bodyToMono(String.class)
.timeout(Duration.ofSeconds(10))
.block();
long elapsed = System.currentTimeMillis() - start;
status.getDetails().put("openaiApi.latencyMs", elapsed);
status.getDetails().put("openaiApi.status", "UP");
} catch (Exception e) {
status.addIssue("OpenAI API异常: " + e.getMessage());
status.getDetails().put("openaiApi.status", "DOWN");
}
}
private void checkKnowledgeBaseIntegrity(HealthStatus status) {
try {
// 这里可以查询Milvus获取实际文档数
// 与预期数量对比,判断是否有数据丢失
status.getDetails().put("knowledgeBase.lastChecked", LocalDateTime.now().toString());
} catch (Exception e) {
log.warn("知识库完整性检查失败: {}", e.getMessage());
}
}
@lombok.Data
static class HealthStatus {
private boolean healthy = true;
private Map<String, Object> details = new HashMap<>();
private java.util.List<String> issues = new java.util.ArrayList<>();
public void addIssue(String issue) {
issues.add(issue);
this.healthy = false;
}
}
}常见问题解答
Q1:向量数据库重启后,RAG服务要多久才能恢复正常?
Milvus重启后需要重新加载集合到内存,这个过程叫做"集合加载"(load collection)。时间取决于数据量:100万文档约需2-5分钟,1000万文档可能需要30分钟以上。生产建议开启Milvus的预加载配置(auto_load=true),减少冷启动时间。
Q2:RAG系统应该保留多久的日志?
应用日志至少保留30天(满足审计要求),访问日志保留90天,告警记录保留1年。日志存储成本不高,但在事故复盘时极为宝贵。建议用ELK或阿里云SLS集中存储,避免日志只在容器内。
Q3:向量数据库磁盘写满了怎么办?
紧急处置:先扩容磁盘(云环境可热扩容),再清理过期或低质量的文档。不要随意删除Milvus数据目录下的文件,可能导致数据损坏。长期方案:设置存储告警阈值(建议80%告警,90%限流),做好容量规划。
Q4:如何判断知识库的数据质量下降了?
监控"空结果率"(rag_queries_total{result="empty"})和"用户负反馈率"(如果有反馈功能)。空结果率突然升高通常意味着:知识库数据被误删、向量重新构建后版本不兼容、嵌入模型变更导致向量分布变化。
Q5:多实例部署时,知识库更新如何保持一致?
知识库更新(文档写入Milvus)本身是全局的,所有实例共享同一个Milvus,写入后所有实例立即生效。但如果有本地缓存(如Redis缓存嵌入结果),更新文档后需要清除相关缓存。建议通过Redis Pub/Sub实现缓存失效通知。
Q6:运维人员没有AI背景,如何快速培训?
重点培训三件事:1)会看Grafana面板,认识关键指标;2)会执行诊断脚本,会看日志;3)有问题先重启再排查(RAG系统大多数问题重启能临时恢复)。详细原理可以后续学习,但紧急情况下能通过SOP处理才是核心。
总结
企业级RAG系统的运维不是"把代码跑起来",而是"保证代码持续稳定地跑"。
行动清单:
一个没有运维保障的RAG系统,只是在等待下一次凌晨3点的电话。
