AI应用的SLA管理:制定和履行AI服务的可用性承诺
AI应用的SLA管理:制定和履行AI服务的可用性承诺
date: 2026-10-22 tags: [SLA, 可用性, 服务治理, Spring AI, Java]
开篇故事:从"说不清楚"到"白纸黑字"
2025年6月,某互联网公司的AI平台团队每天过得水深火热。
AI平台组长周晓每周要花3个多小时处理业务方投诉。投诉内容大同小异:
"你们AI智能客服今天又挂了!"
"为什么这个单子AI分析错了?你们能保证多少准确率?"
"上线前说AI响应2秒内,现在怎么动不动10秒?"
周晓的委屈也很真实:"AI服务不像传统接口,模型本身就有不确定性,外部API偶尔限流,这些我们能控制吗?"
双方都有道理,但协作效率极低。
技术总监孙鹏把两边叫来开了一次会,提出一个解决方案:制定AI服务的SLA(Service Level Agreement)。
会上,双方用了2小时把所有指标谈清楚了:
- 可用性:月度99.5%
- P99延迟:3秒以内
- 结果质量:业务方认可的"可用"比例达到92%以上
- 错误预算:每月4.5小时的允许停机时间
更重要的是约定了双向责任:AI团队承诺达标,业务方承诺不在错误预算内的停机期间紧急投诉(走正常Bug处理流程)。
一个月后,周晓惊喜地发现:投诉减少了70%。
不是因为系统变得更好了(才一个月,没变化太多),而是双方对"好"的定义对齐了。
以前业务方觉得"任何问题都应该投诉",现在他们看到月报:本月可用性99.7%,超额完成99.5%的目标,错误预算剩余85%——心里踏实了。
本文将带你从零构建这套AI服务SLA管理系统。
一、AI应用SLA的特殊性
1.1 传统SLA vs AI SLA
AI SLA最大的挑战:结果质量难以量化
传统服务的SLA只需要衡量"有没有返回结果",但AI服务还需要衡量"返回的结果有没有用"。这需要和业务方共同定义"什么是可用的AI输出"。
1.2 分层SLA设计
层次1:基础可用性SLA(AI团队完全可控)
- 服务在线率:≥ 99.9%/月
- API响应时间(不含AI推理):P99 < 500ms
层次2:端到端延迟SLA(部分可控)
- AI推理端到端P50: < 1s
- AI推理端到端P90: < 2s
- AI推理端到端P99: < 5s
注:受外部模型服务影响,需要在SLA中说明前提条件
层次3:质量SLA(最难承诺)
- 结果可用率:≥ 92%/月
- 定义:业务方抽查100条,92条以上符合业务预期
- 评估周期:按月评估,季度趋势分析
层次4:成本SLA(对内)
- Token成本控制在预算±20%内
- 峰值成本不超过月预算的20%(单日)二、SLA指标设计
2.1 SLI(服务水平指标)定义
// SlaMetricsDefinition.java
public class SlaMetricsDefinition {
/**
* 可用性SLI计算
* = 成功请求数 / 总请求数
*
* 成功的定义:HTTP 200且返回有效结果
* 排除:客户端错误(4xx)、计划内维护窗口
*/
public static final SliDefinition AVAILABILITY = SliDefinition.builder()
.name("availability")
.displayName("服务可用性")
.formula("(total_requests - error_requests) / total_requests")
.errorBudgetMonthly(0.005) // 0.5% = 每月约3.6小时
.excludedErrors(List.of("CLIENT_ERROR_4XX", "PLANNED_MAINTENANCE"))
.build();
/**
* 延迟SLI
* = 在阈值内完成的请求数 / 总请求数
*
* P99 < 5s的达标率
*/
public static final SliDefinition LATENCY = SliDefinition.builder()
.name("latency")
.displayName("响应延迟")
.formula("requests_under_5s / total_requests")
.targetRatio(0.99)
.errorBudgetMonthly(0.01)
.build();
/**
* 质量SLI
* = 业务方认可的结果数 / 总结果数(抽样评估)
*
* 评估方式:每周业务方随机抽查50条
*/
public static final SliDefinition QUALITY = SliDefinition.builder()
.name("quality")
.displayName("结果质量")
.formula("approved_results / evaluated_results")
.targetRatio(0.92)
.evaluationCycle("WEEKLY_SAMPLE")
.sampleSize(50)
.build();
@Data
@Builder
public static class SliDefinition {
private String name;
private String displayName;
private String formula;
private double targetRatio; // SLO目标
private double errorBudgetMonthly; // 月错误预算
private List<String> excludedErrors;
private String evaluationCycle;
private int sampleSize;
}
}2.2 SLO目标文档
// SloConfig.java
@Configuration
@ConfigurationProperties(prefix = "sla")
@Data
public class SloConfig {
private AvailabilitySlo availability = new AvailabilitySlo();
private LatencySlo latency = new LatencySlo();
private QualitySlo quality = new QualitySlo();
private CostSlo cost = new CostSlo();
@Data
public static class AvailabilitySlo {
private double target = 0.995; // 月度99.5%
private double criticalTarget = 0.99; // 周度99%(更宽松)
private int errorBudgetMinutesPerMonth = 216; // 0.5% × 43200 = 216分钟
}
@Data
public static class LatencySlo {
private int p50Ms = 1000; // P50 < 1s
private int p90Ms = 2000; // P90 < 2s
private int p99Ms = 5000; // P99 < 5s
private double p99Ratio = 0.99; // 99%的请求需要在p99Ms内
}
@Data
public static class QualitySlo {
private double approvalRate = 0.92; // 92%可用率
private int weeklyEvalSampleSize = 50; // 每周抽查50条
}
@Data
public static class CostSlo {
private double monthlyBudgetUsd = 5000;
private double dailyBudgetCapPercent = 0.15; // 单日不超过月预算15%
private double warningThresholdPercent = 0.80; // 80%时告警
}
}三、SLA监控实现
3.1 Prometheus指标采集
// SlaMetricsCollector.java
@Component
@RequiredArgsConstructor
public class SlaMetricsCollector {
private final MeterRegistry registry;
// 请求总数
private final Counter totalRequests;
// 错误请求数
private final Counter errorRequests;
// 延迟分布
private final Timer latencyTimer;
// 活跃连接数
private final Gauge activeRequests;
@PostConstruct
public void initMetrics() {
// 可用性指标
Counter.builder("ai.sla.requests.total")
.description("Total AI service requests")
.tags("service", "ai-platform")
.register(registry);
Counter.builder("ai.sla.requests.errors")
.description("Failed AI service requests")
.tags("service", "ai-platform")
.register(registry);
// 延迟指标(使用百分位数Timer)
Timer.builder("ai.sla.latency")
.description("AI inference latency")
.publishPercentiles(0.5, 0.9, 0.95, 0.99)
.publishPercentileHistogram()
.serviceLevelObjectives(
Duration.ofMillis(1000), // P50目标
Duration.ofMillis(2000), // P90目标
Duration.ofMillis(5000) // P99目标
)
.register(registry);
// 错误预算消耗
Gauge.builder("ai.sla.error_budget.remaining", this,
c -> c.calculateRemainingErrorBudget())
.description("Remaining error budget percentage")
.register(registry);
}
/**
* 记录一次AI调用
*/
public void recordRequest(AiRequestMetric metric) {
// 总请求计数
registry.counter("ai.sla.requests.total",
"endpoint", metric.getEndpoint(),
"model", metric.getModel()
).increment();
// 延迟记录
registry.timer("ai.sla.latency",
"endpoint", metric.getEndpoint(),
"model", metric.getModel()
).record(Duration.ofMillis(metric.getLatencyMs()));
// 错误记录
if (!metric.isSuccess()) {
registry.counter("ai.sla.requests.errors",
"endpoint", metric.getEndpoint(),
"error_type", metric.getErrorType(),
"model", metric.getModel()
).increment();
}
// Token成本记录
registry.counter("ai.sla.tokens.total",
"model", metric.getModel(),
"token_type", "prompt"
).increment(metric.getPromptTokens());
registry.counter("ai.sla.tokens.total",
"model", metric.getModel(),
"token_type", "completion"
).increment(metric.getCompletionTokens());
}
private double calculateRemainingErrorBudget() {
// 从Prometheus查询当月已消耗的错误预算
// 这里简化为从Redis读取
return 0.85; // 示例值
}
@Data
@Builder
public static class AiRequestMetric {
private String endpoint;
private String model;
private long latencyMs;
private boolean success;
private String errorType;
private int promptTokens;
private int completionTokens;
private LocalDateTime requestTime;
}
}3.2 Prometheus告警规则
# prometheus-rules/ai-sla-alerts.yml
groups:
- name: ai_sla_alerts
interval: 1m
rules:
# ===== 可用性告警 =====
# 5分钟内可用性跌破99%(需要立即响应)
- alert: AiServiceAvailabilityCritical
expr: |
(
sum(rate(ai_sla_requests_total[5m])) -
sum(rate(ai_sla_requests_errors_total{error_type!="CLIENT_ERROR_4XX"}[5m]))
) / sum(rate(ai_sla_requests_total[5m])) < 0.99
for: 2m
labels:
severity: critical
team: ai-platform
annotations:
summary: "AI服务可用性严重告警"
description: "5分钟可用性: {{ $value | humanizePercentage }},低于99%阈值"
runbook: "https://wiki.company.com/ai-sla-runbook#availability"
# 1小时可用性跌破99.5%(SLO违约风险)
- alert: AiServiceAvailabilityWarning
expr: |
(
sum(rate(ai_sla_requests_total[1h])) -
sum(rate(ai_sla_requests_errors_total{error_type!="CLIENT_ERROR_4XX"}[1h]))
) / sum(rate(ai_sla_requests_total[1h])) < 0.995
for: 10m
labels:
severity: warning
team: ai-platform
annotations:
summary: "AI服务可用性告警"
description: "1小时可用性: {{ $value | humanizePercentage }},接近SLO阈值"
# ===== 延迟告警 =====
# P99延迟超过5秒(SLO违约)
- alert: AiLatencyP99SloBreached
expr: |
histogram_quantile(0.99,
sum(rate(ai_sla_latency_bucket[5m])) by (le)
) > 5.0
for: 3m
labels:
severity: critical
annotations:
summary: "P99延迟超标(SLO违约)"
description: "当前P99延迟: {{ $value | humanizeDuration }},超过5s SLO"
# P90延迟超过2秒(预警)
- alert: AiLatencyP90Warning
expr: |
histogram_quantile(0.90,
sum(rate(ai_sla_latency_bucket[5m])) by (le)
) > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "P90延迟超标预警"
description: "当前P90延迟: {{ $value | humanizeDuration }}"
# ===== 错误预算告警 =====
# 错误预算消耗超过50%(月中预警)
- alert: ErrorBudgetHalfConsumed
expr: ai_sla_error_budget_remaining < 0.50
for: 0m
labels:
severity: warning
annotations:
summary: "错误预算已消耗超过50%"
description: "剩余错误预算: {{ $value | humanizePercentage }},需要关注"
# 错误预算消耗超过80%(紧急)
- alert: ErrorBudgetCritical
expr: ai_sla_error_budget_remaining < 0.20
for: 0m
labels:
severity: critical
annotations:
summary: "错误预算告急!"
description: "剩余错误预算仅: {{ $value | humanizePercentage }}"
# ===== 成本告警 =====
# 日Token成本超过月预算15%
- alert: DailyCostCapExceeded
expr: |
sum(increase(ai_sla_tokens_total[24h])) * 0.000001 * 0.6 > 750
for: 0m
labels:
severity: critical
annotations:
summary: "日Token成本超预算上限"
description: "今日估算成本超过$750(月预算的15%)"四、错误预算:SLO和Error Budget计算
4.1 错误预算服务
// ErrorBudgetService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class ErrorBudgetService {
private final PrometheusMeterRegistry prometheusRegistry;
private final StringRedisTemplate redisTemplate;
private final SloConfig sloConfig;
/**
* 计算当月错误预算使用情况
*/
public ErrorBudgetStatus calculateMonthlyBudget() {
LocalDateTime now = LocalDateTime.now();
LocalDateTime monthStart = now.withDayOfMonth(1).withHour(0).withMinute(0)
.withSecond(0).withNano(0);
long monthMinutes = Duration.between(monthStart, now).toMinutes();
long totalMinutesInMonth = YearMonth.now().lengthOfMonth() * 24 * 60;
// 从Prometheus查询当月错误率
double errorRate = queryMonthlyErrorRate(monthStart);
// 计算消耗的错误预算(分钟)
double consumedBudgetMinutes = monthMinutes * errorRate;
// 总错误预算(分钟)
double totalBudgetMinutes = sloConfig.getAvailability().getErrorBudgetMinutesPerMonth();
// 剩余错误预算
double remainingBudgetMinutes = totalBudgetMinutes - consumedBudgetMinutes;
double remainingBudgetPercent = remainingBudgetMinutes / totalBudgetMinutes;
// 按月剩余时间推算预算是否够用
double remainingMonthMinutes = totalMinutesInMonth - monthMinutes;
double projectedConsumption = (consumedBudgetMinutes / monthMinutes)
* totalMinutesInMonth;
boolean isOnTrack = projectedConsumption < totalBudgetMinutes;
ErrorBudgetStatus status = ErrorBudgetStatus.builder()
.periodStart(monthStart)
.periodEnd(monthStart.plusMonths(1))
.totalBudgetMinutes(totalBudgetMinutes)
.consumedBudgetMinutes(consumedBudgetMinutes)
.remainingBudgetMinutes(remainingBudgetMinutes)
.remainingBudgetPercent(remainingBudgetPercent)
.currentErrorRate(errorRate)
.isOnTrack(isOnTrack)
.projectedMonthlyConsumption(projectedConsumption)
.calculatedAt(now)
.build();
// 缓存结果
redisTemplate.opsForValue().set(
"sla:error-budget:monthly",
JsonUtils.toJson(status),
Duration.ofMinutes(5)
);
return status;
}
/**
* 错误预算烧完率(Burn Rate)
* 如果当前速度继续,月底前会烧完预算吗?
*
* Burn Rate > 1 意味着预算会在月底前耗尽
*/
public double calculateBurnRate(Duration window) {
double currentErrorRate = queryErrorRate(window);
double monthlyErrorBudget = 1.0 - sloConfig.getAvailability().getTarget();
// Burn Rate = 当前错误率 / 错误预算允许的错误率
return currentErrorRate / monthlyErrorBudget;
}
/**
* 多窗口Burn Rate检测(Google SRE推荐方法)
* 同时满足快速窗口和慢速窗口的阈值才触发告警
* 减少误报
*/
public BurnRateAlert checkBurnRate() {
// 快速窗口:1小时(检测突发故障)
double burnRate1h = calculateBurnRate(Duration.ofHours(1));
// 中速窗口:6小时(检测持续问题)
double burnRate6h = calculateBurnRate(Duration.ofHours(6));
// 慢速窗口:24小时(检测缓慢退化)
double burnRate24h = calculateBurnRate(Duration.ofHours(24));
BurnRateAlertLevel level = BurnRateAlertLevel.NORMAL;
// P1告警:1h > 14x 且 5min > 14x(2%预算/小时,消耗太快)
if (burnRate1h > 14 && calculateBurnRate(Duration.ofMinutes(5)) > 14) {
level = BurnRateAlertLevel.CRITICAL;
}
// P2告警:1h > 6x 且 30min > 6x
else if (burnRate1h > 6 && calculateBurnRate(Duration.ofMinutes(30)) > 6) {
level = BurnRateAlertLevel.HIGH;
}
// P3告警:6h > 3x 且 1h > 3x
else if (burnRate6h > 3 && burnRate1h > 3) {
level = BurnRateAlertLevel.MEDIUM;
}
// P4告警:24h > 1x(月底会刚好耗尽)
else if (burnRate24h > 1) {
level = BurnRateAlertLevel.LOW;
}
return BurnRateAlert.builder()
.burnRate1h(burnRate1h)
.burnRate6h(burnRate6h)
.burnRate24h(burnRate24h)
.alertLevel(level)
.build();
}
private double queryMonthlyErrorRate(LocalDateTime since) {
// 从Prometheus查询(实际实现需要调用Prometheus HTTP API)
// 这里简化为模拟值
return 0.003; // 0.3%错误率
}
private double queryErrorRate(Duration window) {
// 查询指定时间窗口内的错误率
return 0.002; // 简化示例
}
public enum BurnRateAlertLevel {
NORMAL, LOW, MEDIUM, HIGH, CRITICAL
}
@Data
@Builder
public static class ErrorBudgetStatus {
private LocalDateTime periodStart;
private LocalDateTime periodEnd;
private double totalBudgetMinutes;
private double consumedBudgetMinutes;
private double remainingBudgetMinutes;
private double remainingBudgetPercent;
private double currentErrorRate;
private boolean isOnTrack;
private double projectedMonthlyConsumption;
private LocalDateTime calculatedAt;
public String toSummary() {
return String.format(
"错误预算状态 [%s 至今]\n" +
"总预算:%.1f分钟\n" +
"已消耗:%.1f分钟 (%.1f%%)\n" +
"剩余:%.1f分钟 (%.1f%%)\n" +
"当前错误率:%.4f%%\n" +
"预测结果:%s",
periodStart.toLocalDate(),
totalBudgetMinutes,
consumedBudgetMinutes,
(consumedBudgetMinutes / totalBudgetMinutes) * 100,
remainingBudgetMinutes,
remainingBudgetPercent * 100,
currentErrorRate * 100,
isOnTrack ? "✓ 预计月底前不会超出预算" : "⚠ 按当前趋势将超出预算"
);
}
}
}五、SLA违约处理:自动补偿机制
5.1 违约检测和自动响应
// SlaViolationHandler.java
@Component
@RequiredArgsConstructor
@Slf4j
public class SlaViolationHandler {
private final SloConfig sloConfig;
private final NotificationService notificationService;
private final DegradationService degradationService;
private final CompensationService compensationService;
/**
* SLA违约事件处理
* 由Prometheus AlertManager或定时任务触发
*/
@EventListener(SlaViolationEvent.class)
public void handleViolation(SlaViolationEvent event) {
log.warn("SLA violation detected: type={}, severity={}, value={}",
event.getViolationType(), event.getSeverity(), event.getCurrentValue());
switch (event.getViolationType()) {
case AVAILABILITY -> handleAvailabilityViolation(event);
case LATENCY -> handleLatencyViolation(event);
case QUALITY -> handleQualityViolation(event);
case COST -> handleCostViolation(event);
}
}
private void handleAvailabilityViolation(SlaViolationEvent event) {
// 1. 立即通知
notificationService.sendUrgentAlert(
AlertMessage.builder()
.title("⚠️ AI服务可用性SLA违约")
.body(String.format("当前可用性:%.2f%%,SLO目标:%.2f%%",
event.getCurrentValue() * 100,
sloConfig.getAvailability().getTarget() * 100))
.level(event.getSeverity())
.oncallTeam("ai-platform")
.build()
);
// 2. 触发降级(如果配置了降级策略)
if (event.getSeverity() == Severity.CRITICAL) {
degradationService.activateFallback(FallbackStrategy.RULE_BASED);
log.info("Fallback strategy activated due to SLA violation");
}
// 3. 记录违约事件
recordViolationIncident(event);
}
private void handleLatencyViolation(SlaViolationEvent event) {
// 1. 告警
notificationService.sendAlert(AlertMessage.builder()
.title("P99延迟超标")
.body(String.format("当前P99: %.0fms,SLO: %dms",
event.getCurrentValue(), sloConfig.getLatency().getP99Ms()))
.level(event.getSeverity())
.build()
);
// 2. 自动扩容(如果支持)
if (event.getCurrentValue() > sloConfig.getLatency().getP99Ms() * 2) {
// 延迟超过SLO 2倍时,触发自动扩容
autoScalingService.scaleUp("ai-inference-service", 2);
}
}
private void handleCostViolation(SlaViolationEvent event) {
// 成本超限:触发降级到更便宜的模型
degradationService.activateFallback(FallbackStrategy.CHEAPER_MODEL);
notificationService.sendAlert(AlertMessage.builder()
.title("💰 AI成本超预算告警")
.body(String.format("今日成本:$%.2f,已超日上限$%.2f",
event.getCurrentValue(),
sloConfig.getCost().getMonthlyBudgetUsd() *
sloConfig.getCost().getDailyBudgetCapPercent()))
.level(Severity.WARNING)
.build()
);
}
private void handleQualityViolation(SlaViolationEvent event) {
// 质量下降时,通知人工审核队列需要加强抽查
notificationService.sendAlert(AlertMessage.builder()
.title("AI结果质量下降告警")
.body(String.format("本周可用率:%.1f%%,SLO: 92%%,需要人工复查",
event.getCurrentValue() * 100))
.level(Severity.WARNING)
.recipients(List.of("ai-quality-team"))
.build()
);
}
private void recordViolationIncident(SlaViolationEvent event) {
// 记录到数据库,用于月报统计
}
}
// 降级服务
@Service
@Slf4j
public class DegradationService {
/**
* 激活降级策略
*/
public void activateFallback(FallbackStrategy strategy) {
switch (strategy) {
case RULE_BASED -> {
// 切换到规则引擎(无AI,快速响应)
FeatureFlags.set("use_ai", false);
log.info("Switched to rule-based fallback");
}
case CHEAPER_MODEL -> {
// 切换到更便宜的模型(如gpt-3.5)
FeatureFlags.set("ai_model", "gpt-3.5-turbo");
log.info("Switched to cheaper model: gpt-3.5-turbo");
}
case CACHE_ONLY -> {
// 只返回缓存结果,不调用AI
FeatureFlags.set("ai_cache_only", true);
log.info("Activated cache-only mode");
}
}
}
public enum FallbackStrategy {
RULE_BASED, CHEAPER_MODEL, CACHE_ONLY
}
}六、SLA月报:Spring Batch自动生成
6.1 SLA报告生成Job
// SlaReportGenerationJob.java
@Configuration
@RequiredArgsConstructor
public class SlaReportJobConfig {
private final JobRepository jobRepository;
private final PlatformTransactionManager transactionManager;
private final SlaDataQueryService slaDataQueryService;
private final SlaReportRenderer reportRenderer;
private final ReportDeliveryService deliveryService;
@Bean
public Job slaMonthlyReportJob() {
return new JobBuilder("slaMonthlyReportJob", jobRepository)
.start(collectMetricsStep())
.next(calculateSloStep())
.next(generateReportStep())
.next(deliverReportStep())
.build();
}
@Bean
public Step calculateSloStep() {
return new StepBuilder("calculateSloStep", jobRepository)
.tasklet((contribution, chunkContext) -> {
YearMonth reportMonth = getReportMonth(chunkContext);
// 1. 计算可用性
AvailabilityMetrics availability = slaDataQueryService
.queryAvailability(reportMonth);
// 2. 计算延迟分布
LatencyMetrics latency = slaDataQueryService
.queryLatencyPercentiles(reportMonth);
// 3. 计算质量评分
QualityMetrics quality = slaDataQueryService
.queryQualityEvaluations(reportMonth);
// 4. 计算成本
CostMetrics cost = slaDataQueryService
.queryCostMetrics(reportMonth);
// 5. 计算错误预算
ErrorBudgetMetrics errorBudget = calculateErrorBudget(
availability, reportMonth);
// 保存到ExecutionContext供下一步使用
ExecutionContext ctx = chunkContext.getStepContext()
.getStepExecution().getJobExecution().getExecutionContext();
ctx.put("availability", availability);
ctx.put("latency", latency);
ctx.put("quality", quality);
ctx.put("cost", cost);
ctx.put("errorBudget", errorBudget);
return RepeatStatus.FINISHED;
}, transactionManager)
.build();
}
@Bean
public Step generateReportStep() {
return new StepBuilder("generateReportStep", jobRepository)
.tasklet((contribution, chunkContext) -> {
ExecutionContext ctx = chunkContext.getStepContext()
.getStepExecution().getJobExecution().getExecutionContext();
AvailabilityMetrics availability = (AvailabilityMetrics) ctx.get("availability");
LatencyMetrics latency = (LatencyMetrics) ctx.get("latency");
QualityMetrics quality = (QualityMetrics) ctx.get("quality");
CostMetrics cost = (CostMetrics) ctx.get("cost");
ErrorBudgetMetrics errorBudget = (ErrorBudgetMetrics) ctx.get("errorBudget");
// 生成报告
SlaReport report = buildReport(availability, latency, quality,
cost, errorBudget);
// 渲染为HTML
String htmlReport = reportRenderer.renderHtml(report);
// 渲染为Markdown(用于内部Wiki)
String markdownReport = reportRenderer.renderMarkdown(report);
ctx.putString("htmlReport", htmlReport);
ctx.putString("markdownReport", markdownReport);
return RepeatStatus.FINISHED;
}, transactionManager)
.build();
}
@Bean
public Step deliverReportStep() {
return new StepBuilder("deliverReportStep", jobRepository)
.tasklet((contribution, chunkContext) -> {
ExecutionContext ctx = chunkContext.getStepContext()
.getStepExecution().getJobExecution().getExecutionContext();
String htmlReport = ctx.getString("htmlReport");
String markdownReport = ctx.getString("markdownReport");
// 发送邮件给所有相关方
deliveryService.sendEmail(
htmlReport,
List.of("ai-platform@company.com", "business-team@company.com")
);
// 发布到内部Wiki
deliveryService.publishToWiki(markdownReport);
// 发送Slack摘要
deliveryService.sendSlackSummary(buildSlackSummary(ctx));
return RepeatStatus.FINISHED;
}, transactionManager)
.build();
}
private SlaReport buildReport(AvailabilityMetrics availability,
LatencyMetrics latency, QualityMetrics quality,
CostMetrics cost, ErrorBudgetMetrics errorBudget) {
return SlaReport.builder()
.reportMonth(YearMonth.now().minusMonths(1))
.generatedAt(LocalDateTime.now())
.availabilityResult(SloResult.builder()
.sloName("服务可用性")
.target(0.995)
.actual(availability.getMonthlyAvailability())
.met(availability.getMonthlyAvailability() >= 0.995)
.build())
.latencyResult(SloResult.builder()
.sloName("P99延迟 < 5s")
.target(0.99)
.actual(latency.getP99WithinSloRate())
.met(latency.getP99WithinSloRate() >= 0.99)
.build())
.qualityResult(SloResult.builder()
.sloName("结果可用率")
.target(0.92)
.actual(quality.getApprovalRate())
.met(quality.getApprovalRate() >= 0.92)
.build())
.errorBudget(errorBudget)
.cost(cost)
.incidents(queryMonthlyIncidents())
.improvements(generateImprovementRecommendations(
availability, latency, quality))
.build();
}
private String buildSlackSummary(ExecutionContext ctx) {
AvailabilityMetrics av = (AvailabilityMetrics) ctx.get("availability");
return String.format("""
📊 *AI服务本月SLA报告*
✅ 可用性:%.2f%% (SLO: 99.5%%) %s
✅ P99延迟:%.0fms %s
✅ 结果质量:%.1f%% %s
📋 详细报告已发送至邮件,请查收。
""",
av.getMonthlyAvailability() * 100,
av.getMonthlyAvailability() >= 0.995 ? "✅" : "❌",
0.0, // 简化
"✅",
0.0, // 简化
"✅"
);
}
private YearMonth getReportMonth(ChunkContext ctx) {
// 默认为上个月
String monthParam = (String) ctx.getStepContext()
.getJobParameters().get("reportMonth");
return monthParam != null ?
YearMonth.parse(monthParam) : YearMonth.now().minusMonths(1);
}
private List<IncidentRecord> queryMonthlyIncidents() {
return List.of(); // 查询数据库
}
private List<ImprovementRecommendation> generateImprovementRecommendations(
AvailabilityMetrics av, LatencyMetrics lt, QualityMetrics ql) {
List<ImprovementRecommendation> recs = new ArrayList<>();
if (!av.isSloMet()) {
recs.add(ImprovementRecommendation.of("可用性改进",
"本月可用性未达标,建议:1. 加强外部API超时配置;2. 完善降级策略"));
}
if (!lt.isP99SloMet()) {
recs.add(ImprovementRecommendation.of("延迟优化",
"P99延迟超标,建议:1. 增加结果缓存;2. 异步化非关键AI调用"));
}
return recs;
}
private ErrorBudgetMetrics calculateErrorBudget(AvailabilityMetrics av,
YearMonth month) {
double totalBudgetMinutes = month.lengthOfMonth() * 24 * 60 * 0.005;
double consumedMinutes = month.lengthOfMonth() * 24 * 60 * (1 - av.getMonthlyAvailability());
return ErrorBudgetMetrics.builder()
.totalBudgetMinutes(totalBudgetMinutes)
.consumedMinutes(consumedMinutes)
.remainingMinutes(totalBudgetMinutes - consumedMinutes)
.build();
}
}七、SLA协商框架
7.1 AI服务SLA的谈判清单
在和业务方制定SLA之前,需要明确以下前提条件:
SLA谈判前置清单:
1. 明确AI服务的依赖关系
☐ 外部AI API的SLA是多少?(如OpenAI声明99.9%,实际95-99%)
☐ 我们的SLA必须低于外部依赖的SLA
2. 定义"质量可用"的标准
☐ 什么样的AI输出算"可用"?(需要举5个具体案例)
☐ 评估由谁做?频率多高?
☐ 结果有争议时如何仲裁?
3. 排除项约定
☐ 计划内维护窗口(每月几小时?提前多久通知?)
☐ 外部服务故障(不计入我方SLA)
☐ 客户端错误(业务方自身的调用问题不计入)
4. 业务方义务
☐ 提前多久提报异常流量(大促、活动)?
☐ 接口规范使用(不超QPS限制)
☐ 合理的bug报告时间(不在凌晨2点要求"紧急"修复非关键bug)
5. 补偿条款(可选)
☐ SLA违约时的补偿方式(延长额度/优先排期/Token补充)
☐ 补偿上限
☐ 触发条件(连续多久违约才触发补偿)7.2 SLA文档模板
AI服务SLA协议
版本:v2.0
生效日期:2025-10-01
签署方:AI平台组 × 智能客服业务组
## 服务范围
本协议适用于智能客服API(/api/v1/ai/customer-service)。
## 服务级别目标(SLO)
| 指标 | SLO目标 | 测量周期 | 排除项 |
|------|---------|---------|--------|
| 服务可用性 | ≥ 99.5% | 月度 | 计划维护、外部服务故障、客户端错误 |
| P99延迟 | < 5秒 | 月度(99%请求达标) | 外部模型超时(>30s视为故障) |
| 结果可用率 | ≥ 92% | 每月抽查50条 | 争议案例由双方各出2人的评审委员会裁定 |
## 错误预算
月度错误预算 = (1 - 99.5%) × 月份天数 × 24小时 × 60分钟 = 约216分钟/月
错误预算消耗超过50%时,AI平台组主动通知业务方。
错误预算消耗超过80%时,双方启动应急响应流程。
## 违约处理
连续30天未达SLO目标时:
1. AI平台组出具书面原因报告
2. 次月优先安排改进排期
3. 根据影响程度酌情补充Token额度
## 双方义务
AI平台组承诺:
- 提前72小时通知计划内维护(紧急情况提前2小时)
- 每月5日前发布上月SLA履约报告
- SLA违约时24小时内启动根因分析
业务方承诺:
- 大促/活动提前7天通知预期流量(峰值QPS)
- API调用不超过约定的QPS上限
- SLA错误预算内的故障走正常Bug流程,不作紧急投诉八、依赖SLA:风险传递分析
8.1 外部AI服务SLA对我方的影响
// DependencyRiskCalculator.java
@Service
@Slf4j
public class DependencyRiskCalculator {
/**
* 计算依赖外部AI服务时,我方SLA的理论上限
*
* 规则:我方SLA ≤ 外部依赖SLA
* 如果使用多个AI服务并有降级,可以改善
*/
public SlaRiskAnalysis analyzeExternalDependencyRisk(
List<ExternalServiceConfig> dependencies) {
// 假设各服务独立故障
double systemAvailability = 1.0;
double totalDegradedAvailability = 1.0;
for (ExternalServiceConfig dep : dependencies) {
if (dep.isCritical()) {
// 关键依赖:任一故障整体失败
systemAvailability *= dep.getClaimedSla();
}
// 有降级:使用降级后的可用性
double withDegradation = dep.getClaimedSla() +
(1 - dep.getClaimedSla()) * dep.getDegradationAvailability();
totalDegradedAvailability *= withDegradation;
}
return SlaRiskAnalysis.builder()
.theoreticalMaxSla(systemAvailability)
.withDegradationMaxSla(totalDegradedAvailability)
.recommendation(buildRecommendation(systemAvailability, totalDegradedAvailability))
.build();
}
private String buildRecommendation(double withoutDegradation, double withDegradation) {
StringBuilder rec = new StringBuilder();
if (withoutDegradation < 0.999) {
rec.append("⚠ 外部依赖SLA较低,承诺99.9%以上的SLA风险极高。建议:\n");
}
if (withDegradation > withoutDegradation + 0.005) {
rec.append("✓ 降级策略有效,可将理论可用性从%.3f%%提升至%.3f%%\n"
.formatted(withoutDegradation * 100, withDegradation * 100));
}
return rec.toString();
}
@Data
@Builder
public static class ExternalServiceConfig {
private String name; // eg: OpenAI API
private double claimedSla; // eg: 0.999 (99.9%)
private double actualSla; // 基于监控的实际值
private boolean critical; // 是否核心依赖
private double degradationAvailability; // 有降级时的可用性(降级成功率)
}
@Data
@Builder
public static class SlaRiskAnalysis {
private double theoreticalMaxSla;
private double withDegradationMaxSla;
private String recommendation;
}
}九、SLA持续改进:从违约事件到系统改进
9.1 违约事件复盘模板
// SlaIncidentReview.java
@Service
@RequiredArgsConstructor
public class SlaIncidentReviewService {
private final ChatClient chatClient;
/**
* 用AI辅助生成事故复盘报告
*/
public String generatePostmortem(IncidentData incident) {
String prompt = String.format("""
基于以下事故数据,生成一份标准的SLA事故复盘报告:
事故时间:%s 至 %s(持续%d分钟)
影响:%s
SLO违约情况:%s
事故时间线:%s
已采取的临时措施:%s
报告格式:
## 事故摘要
## 影响评估
## 根本原因分析(5Why方法)
## 时间线
## 临时修复措施
## 永久修复方案(具体的action item,负责人,截止日期)
## 预防措施(如何避免同类事故)
## SLO影响(消耗了多少错误预算)
语言要求:简洁、客观,不要回避问题,要具体可行。
""",
incident.getStartTime(),
incident.getEndTime(),
incident.getDurationMinutes(),
incident.getImpactDescription(),
incident.getSloViolations(),
incident.getTimeline(),
incident.getMitigationActions()
);
return chatClient.prompt()
.user(prompt)
.options(OpenAiChatOptions.builder()
.withModel("gpt-4o")
.build())
.call()
.content();
}
}十、SLA Dashboard实现
10.1 实时SLA Dashboard API
// SlaDashboardController.java
@RestController
@RequestMapping("/api/v1/sla")
@RequiredArgsConstructor
@Tag(name = "SLA Dashboard", description = "SLA状态查询接口")
public class SlaDashboardController {
private final ErrorBudgetService errorBudgetService;
private final SlaMetricsQueryService metricsQueryService;
@GetMapping("/status")
@Operation(summary = "获取当前SLA状态(实时)")
public ResponseEntity<SlaStatusResponse> getCurrentStatus() {
// 实时计算各SLI
double availability1h = metricsQueryService.queryAvailability(Duration.ofHours(1));
double availabilityDay = metricsQueryService.queryAvailability(Duration.ofDays(1));
double availabilityMonth = metricsQueryService.queryAvailability(
LocalDateTime.now().withDayOfMonth(1).withHour(0));
LatencyPercentiles latency = metricsQueryService.queryLatencyPercentiles(
Duration.ofHours(1));
ErrorBudgetService.ErrorBudgetStatus budget =
errorBudgetService.calculateMonthlyBudget();
ErrorBudgetService.BurnRateAlert burnRate =
errorBudgetService.checkBurnRate();
return ResponseEntity.ok(SlaStatusResponse.builder()
.availability(AvailabilityStatus.builder()
.last1h(availability1h)
.last24h(availabilityDay)
.monthToDate(availabilityMonth)
.sloTarget(0.995)
.isHealthy(availabilityMonth >= 0.995)
.build())
.latency(LatencyStatus.builder()
.p50Ms(latency.getP50())
.p90Ms(latency.getP90())
.p99Ms(latency.getP99())
.p99SloMs(5000)
.isHealthy(latency.getP99() < 5000)
.build())
.errorBudget(ErrorBudgetStatus.builder()
.totalMinutes(budget.getTotalBudgetMinutes())
.consumedMinutes(budget.getConsumedBudgetMinutes())
.remainingPercent(budget.getRemainingBudgetPercent())
.burnRateAlertLevel(burnRate.getAlertLevel().name())
.isHealthy(budget.getRemainingBudgetPercent() > 0.20)
.build())
.overallHealth(determineOverallHealth(availabilityMonth,
latency.getP99(), budget.getRemainingBudgetPercent()))
.generatedAt(LocalDateTime.now())
.build()
);
}
@GetMapping("/report/{yearMonth}")
@Operation(summary = "获取指定月份的SLA报告")
public ResponseEntity<SlaMonthlyReport> getMonthlyReport(
@PathVariable @DateTimeFormat(pattern = "yyyy-MM") YearMonth yearMonth) {
return ResponseEntity.ok(metricsQueryService.queryMonthlyReport(yearMonth));
}
private String determineOverallHealth(double availability, long p99Ms,
double budgetRemaining) {
if (availability < 0.99 || p99Ms > 10000 || budgetRemaining < 0.05) {
return "CRITICAL";
}
if (availability < 0.995 || p99Ms > 5000 || budgetRemaining < 0.20) {
return "WARNING";
}
return "HEALTHY";
}
}性能数据
SLA监控系统的资源消耗
| 组件 | CPU | 内存 | 存储/月 |
|---|---|---|---|
| Prometheus指标采集 | < 0.1核 | < 100MB | 500MB(压缩后) |
| Grafana Dashboard | < 0.2核 | < 256MB | - |
| SLA报告生成(月跑1次) | 0.5核/5分钟 | 512MB | - |
| Redis错误预算缓存 | 可忽略 | < 10MB | - |
| 合计 | < 0.4核(持续) | < 600MB | 500MB/月 |
FAQ
Q1:AI服务的结果质量SLA怎么量化?谁来评估?
最务实的方案:每周由业务方人工抽查50条(花约1-2小时),AI团队提供一个简单的标注工具(一键好/差)。双方约定好"什么是好的结果"的标准,写进SLA文档附录(至少10个反例)。自动评估可以作为辅助,但最终以人工标注为准。
Q2:外部AI API经常不稳定,SLA承诺不了怎么办?
可以在SLA中明确"服务可用性不含外部AI服务不可控故障期间",并提供故障证明(调用日志+外部服务的状态页截图)。同时设计降级策略:外部AI不可用时切换到规则引擎,部分可用性计入SLA。这样SLA可以承诺更高(比如99.9%),降级期间用不那么好的结果但系统仍然可用。
Q3:错误预算耗尽了,是停止发布还是继续发布?
这是Google SRE的核心理念。错误预算耗尽时,应该停止所有可能影响稳定性的发布(不是所有发布),专注于可靠性改进工作。实际操作:错误预算 < 20%时,每次发布需要额外的风险评审。错误预算 = 0时,只允许修复稳定性问题的发布。
Q4:新上线的AI功能,还没有历史数据,SLA目标怎么定?
分阶段:第1个月是"观察期",不承诺SLA,只做监控;第2-3个月是"试运行期",承诺宽松的SLA(如99%可用性);第4个月起转入正式SLA。这样既不给团队过大压力,又保护了业务方。
Q5:SLA报告业务方看不懂技术指标,怎么办?
报告设计要有两个版本:1)业务摘要(一句话):"本月AI服务整体健康,可用性99.7%,超额完成目标,结果质量评估中92.3%的输出获得业务方好评。"2)技术详情(供需要的人看):包含所有具体指标和趋势图。大多数业务方只需要看摘要和是否有"红灯"。
总结
SLA管理的本质不是技术问题,是协作问题。
从周晓团队的实践来看:制定SLA后,投诉减少70%,不是因为服务变好了,而是双方对"好"的定义对齐了。这是AI平台团队走向专业化的必经之路。
