第1763篇:自动化故障处理Runbook——让Agent执行标准化运维流程
第1763篇:自动化故障处理Runbook——让Agent执行标准化运维流程
说一件让我印象很深的事。
某次凌晨三点,一个P1告警把运维同事叫起来。问题是Redis集群的某个节点OOM(内存溢出)重启了,主从切换正在进行,服务出现了20秒的写入抖动。整个处理过程:接到告警→确认问题→执行预设的处理步骤→验证恢复→写值班日志,前后花了45分钟。
但这个故障的处理流程,其实是有固定步骤的。运维团队内部有一份Runbook(标准操作手册),写得很详细,总共11个步骤,每次都是照着走。
这种场景,是Agent自动化最完美的应用场景之一。
Runbook自动化的本质
Runbook本质上是一个有序的步骤序列,每个步骤是一个具体操作:执行命令、调用API、检查状态、发送通知……
把这些步骤用Agent来执行,有几个核心价值:
- 消除人工响应延迟:告警触发后立即开始执行,不用等人醒来。
- 保证执行一致性:每次按标准步骤来,不会因为工程师状态不好而漏步骤。
- 自动生成执行日志:每个步骤的输入输出都记录下来,事后复盘有据可查。
- 人工决策点控制:高风险操作(比如重启实例)需要人工确认,低风险操作自动执行。
系统架构设计
关键设计决策:人工决策点(Human-in-the-loop)的位置。不是所有步骤都能全自动,高风险的操作必须人工确认。这个边界要画清楚。
Runbook定义格式
我们设计了一个YAML格式的Runbook定义规范:
# redis-oom-runbook.yaml
name: Redis节点OOM处理
version: "1.2"
description: 当Redis节点发生OOM时的标准处理流程
triggerConditions:
- alertName: "RedisMemoryUsageHigh"
threshold: "95%"
- alertName: "RedisOOMKilled"
steps:
- id: check_current_status
name: 检查当前Redis集群状态
type: shell_command
requireApproval: false
command: "redis-cli -h {{ alert.host }} -p {{ alert.port }} INFO memory"
timeout: 10s
onFailure: CONTINUE # 即使失败也继续(状态检查失败不影响后续)
- id: check_replication
name: 检查主从复制状态
type: shell_command
requireApproval: false
command: "redis-cli -h {{ alert.host }} cluster nodes"
timeout: 10s
onFailure: STOP
- id: analyze_memory
name: LLM分析内存使用情况
type: llm_analysis
requireApproval: false
context:
- "{{ steps.check_current_status.output }}"
- "{{ steps.check_replication.output }}"
question: "根据以上Redis INFO输出,分析内存使用是否正常,是否需要立即扩容或清理"
- id: notify_team
name: 通知值班团队
type: notification
requireApproval: false
channel: "dingtalk"
message: |
Redis节点OOM告警已触发自动处理流程
受影响节点: {{ alert.host }}:{{ alert.port }}
当前状态分析: {{ steps.analyze_memory.output }}
- id: flush_expired_keys
name: 清理过期Key(低风险操作)
type: shell_command
requireApproval: false
command: "redis-cli -h {{ alert.host }} debug sleep 0"
description: "触发Redis主动清理过期Key,不影响正常请求"
timeout: 30s
- id: emergency_eviction
name: 调整驱逐策略为allkeys-lru
type: shell_command
requireApproval: true # 需要人工确认,因为会影响缓存命中率
approvalMessage: "即将修改Redis驱逐策略为allkeys-lru,确认继续?"
command: "redis-cli -h {{ alert.host }} config set maxmemory-policy allkeys-lru"
rollbackCommand: "redis-cli -h {{ alert.host }} config set maxmemory-policy {{ original_policy }}"
- id: verify_recovery
name: 验证内存使用是否恢复正常
type: shell_command
requireApproval: false
command: "redis-cli -h {{ alert.host }} INFO memory | grep used_memory_human"
timeout: 10s
condition: "output contains 'used_memory_human'"
- id: update_incident
name: 更新故障工单
type: http_api
requireApproval: false
method: POST
url: "{{ config.jira_base_url }}/rest/api/2/issue/{{ alert.incidentId }}/comment"
body: |
{
"body": "自动处理完成。执行摘要:\n{{ execution_summary }}"
}这个YAML格式支持:模板变量({{ }}语法)、步骤间依赖({{ steps.xxx.output }})、条件执行、人工审批、回滚命令。
Java实现:Runbook执行引擎
@Service
@Slf4j
public class RunbookExecutionEngine {
@Autowired
private ToolExecutorRegistry toolRegistry;
@Autowired
private HumanApprovalService approvalService;
@Autowired
private TemplateEngine templateEngine;
@Autowired
private ExecutionLogRepository logRepo;
@Data
public static class ExecutionContext {
private String executionId;
private NormalizedAlert alert;
private Map<String, StepResult> stepResults = new HashMap<>();
private Map<String, Object> variables = new HashMap<>();
private ExecutionStatus status;
private Instant startTime;
}
@Data
@Builder
public static class StepResult {
private String stepId;
private boolean success;
private String output;
private String errorMessage;
private Instant startTime;
private Instant endTime;
private boolean skipped;
private boolean pendingApproval;
}
public ExecutionContext execute(RunbookDefinition runbook, NormalizedAlert alert) {
ExecutionContext ctx = new ExecutionContext();
ctx.setExecutionId(UUID.randomUUID().toString());
ctx.setAlert(alert);
ctx.setStatus(ExecutionStatus.RUNNING);
ctx.setStartTime(Instant.now());
log.info("开始执行Runbook: name={}, executionId={}, alert={}",
runbook.getName(), ctx.getExecutionId(), alert.getAlertId());
for (RunbookStep step : runbook.getSteps()) {
if (!shouldExecuteStep(step, ctx)) {
log.info("跳过步骤: {}", step.getId());
ctx.getStepResults().put(step.getId(),
StepResult.builder().stepId(step.getId()).skipped(true).build());
continue;
}
StepResult result = executeStep(step, ctx);
ctx.getStepResults().put(step.getId(), result);
// 记录执行日志
logRepo.saveStepResult(ctx.getExecutionId(), result);
// 步骤失败处理
if (!result.isSuccess() && !result.isSkipped()) {
if (step.getOnFailure() == OnFailureAction.STOP) {
log.error("步骤失败,停止执行: stepId={}", step.getId());
ctx.setStatus(ExecutionStatus.FAILED);
notifyFailure(ctx, step, result);
return ctx;
}
// CONTINUE:记录失败但继续
log.warn("步骤失败,继续执行: stepId={}", step.getId());
}
}
ctx.setStatus(ExecutionStatus.COMPLETED);
generateExecutionReport(ctx, runbook);
return ctx;
}
private StepResult executeStep(RunbookStep step, ExecutionContext ctx) {
Instant stepStart = Instant.now();
// 如果需要人工审批,等待审批
if (step.isRequireApproval()) {
String approvalMessage = renderTemplate(step.getApprovalMessage(), ctx);
boolean approved = approvalService.requestApproval(
ctx.getExecutionId(), step.getId(), approvalMessage);
if (!approved) {
return StepResult.builder()
.stepId(step.getId())
.success(false)
.errorMessage("人工审批被拒绝")
.startTime(stepStart)
.endTime(Instant.now())
.build();
}
}
// 渲染模板变量
String renderedCommand = renderTemplate(
step.getCommand() != null ? step.getCommand() : "", ctx);
// 根据步骤类型选择执行器
ToolExecutor executor = toolRegistry.getExecutor(step.getType());
try {
String output = executor.execute(step, renderedCommand, ctx);
return StepResult.builder()
.stepId(step.getId())
.success(true)
.output(output)
.startTime(stepStart)
.endTime(Instant.now())
.build();
} catch (Exception e) {
log.error("步骤执行失败: stepId={}", step.getId(), e);
return StepResult.builder()
.stepId(step.getId())
.success(false)
.errorMessage(e.getMessage())
.startTime(stepStart)
.endTime(Instant.now())
.build();
}
}
private String renderTemplate(String template, ExecutionContext ctx) {
Map<String, Object> variables = new HashMap<>();
variables.put("alert", ctx.getAlert());
variables.put("steps", ctx.getStepResults());
variables.put("config", getSystemConfig());
// 合并执行上下文中的变量
variables.putAll(ctx.getVariables());
return templateEngine.render(template, variables);
}
}LLM步骤执行器
LLM类型的步骤比较特殊,需要单独处理:
@Component
public class LlmStepExecutor implements ToolExecutor {
@Autowired
private OpenAiService openAiService;
@Override
public String getType() {
return "llm_analysis";
}
@Override
public String execute(RunbookStep step, String renderedCommand,
ExecutionContext ctx) {
// 收集上下文信息
List<String> contextParts = new ArrayList<>();
if (step.getContext() != null) {
for (String contextRef : step.getContext()) {
String resolved = resolveContextRef(contextRef, ctx);
if (resolved != null) {
contextParts.add(resolved);
}
}
}
String question = step.getQuestion();
String userMessage = "上下文信息:\n\n" +
String.join("\n---\n", contextParts) +
"\n\n问题: " + question;
ChatCompletionRequest request = ChatCompletionRequest.builder()
.model("gpt-4o")
.messages(List.of(
new ChatMessage("system",
"你是一位资深SRE,请根据提供的运维数据,简洁准确地回答问题。" +
"回答控制在200字以内,重点突出关键判断和建议。"),
new ChatMessage("user", userMessage)
))
.temperature(0.1)
.maxTokens(500)
.build();
return openAiService.createChatCompletion(request)
.getChoices().get(0).getMessage().getContent();
}
private String resolveContextRef(String ref, ExecutionContext ctx) {
// 解析 {{ steps.xxx.output }} 格式的引用
if (ref.startsWith("{{ steps.") && ref.endsWith(".output }}")) {
String stepId = ref.substring(9, ref.length() - 9);
StepResult result = ctx.getStepResults().get(stepId);
if (result != null) {
return result.getOutput();
}
}
return ref;
}
}Shell命令执行器(带安全控制)
@Component
@Slf4j
public class ShellCommandExecutor implements ToolExecutor {
// 高危命令黑名单
private static final List<String> DANGEROUS_COMMANDS = List.of(
"rm -rf", "dd if=", "mkfs", "fdisk", "> /dev/",
"shutdown", "reboot", "halt"
);
@Override
public String execute(RunbookStep step, String command, ExecutionContext ctx) {
// 安全检查
for (String dangerous : DANGEROUS_COMMANDS) {
if (command.contains(dangerous)) {
throw new SecurityException("命令包含危险操作,拒绝执行: " + dangerous);
}
}
// 执行前记录
log.info("执行Shell命令: executionId={}, stepId={}, command={}",
ctx.getExecutionId(), step.getId(), command);
int timeoutSeconds = parseTimeout(step.getTimeout());
try {
ProcessBuilder pb = new ProcessBuilder("bash", "-c", command);
pb.redirectErrorStream(true);
Process process = pb.start();
// 读取输出
StringBuilder output = new StringBuilder();
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(process.getInputStream()))) {
String line;
int lineCount = 0;
while ((line = reader.readLine()) != null && lineCount < 200) {
output.append(line).append("\n");
lineCount++;
}
if (lineCount >= 200) {
output.append("[输出过长,已截断]");
}
}
boolean finished = process.waitFor(timeoutSeconds, TimeUnit.SECONDS);
if (!finished) {
process.destroyForcibly();
throw new TimeoutException("命令执行超时: " + timeoutSeconds + "s");
}
int exitCode = process.exitValue();
if (exitCode != 0) {
throw new RuntimeException(
String.format("命令退出码非零: %d, 输出: %s", exitCode, output));
}
return output.toString().trim();
} catch (SecurityException e) {
throw e; // 安全异常直接向上抛
} catch (Exception e) {
throw new RuntimeException("命令执行失败: " + e.getMessage(), e);
}
}
private int parseTimeout(String timeout) {
if (timeout == null) return 60;
if (timeout.endsWith("s")) {
return Integer.parseInt(timeout.replace("s", ""));
}
if (timeout.endsWith("m")) {
return Integer.parseInt(timeout.replace("m", "")) * 60;
}
return 60;
}
}人工审批流程
凌晨三点触发的高危步骤,怎么让工程师能快速审批?
@Service
@Slf4j
public class DingtalkApprovalService implements HumanApprovalService {
private final Map<String, CompletableFuture<Boolean>> pendingApprovals =
new ConcurrentHashMap<>();
@Override
public boolean requestApproval(String executionId, String stepId, String message) {
String approvalKey = executionId + ":" + stepId;
CompletableFuture<Boolean> future = new CompletableFuture<>();
pendingApprovals.put(approvalKey, future);
// 发送钉钉审批消息,带可交互按钮
sendDingtalkApprovalCard(executionId, stepId, message, approvalKey);
try {
// 等待工程师在钉钉上点击审批(最多等5分钟)
Boolean result = future.get(5, TimeUnit.MINUTES);
return Boolean.TRUE.equals(result);
} catch (TimeoutException e) {
log.warn("审批超时,默认拒绝: approvalKey={}", approvalKey);
return false;
} catch (Exception e) {
log.error("审批等待异常", e);
return false;
} finally {
pendingApprovals.remove(approvalKey);
}
}
// 钉钉回调接口,处理工程师的审批点击
@PostMapping("/approval/callback")
public void handleApprovalCallback(@RequestBody ApprovalCallbackDto dto) {
String approvalKey = dto.getExecutionId() + ":" + dto.getStepId();
CompletableFuture<Boolean> future = pendingApprovals.get(approvalKey);
if (future != null) {
future.complete(dto.getApproved());
log.info("收到审批回调: key={}, approved={}, operator={}",
approvalKey, dto.getApproved(), dto.getOperatorName());
}
}
private void sendDingtalkApprovalCard(String executionId, String stepId,
String message, String approvalKey) {
// 构建钉钉交互式卡片消息
DingtalkCardMessage card = DingtalkCardMessage.builder()
.title("🔧 Runbook执行需要您的确认")
.content(message)
.buttons(List.of(
new CardButton("批准执行", "approve", approvalKey),
new CardButton("拒绝跳过", "reject", approvalKey)
))
.build();
dingtalkService.sendCard(card);
}
}一个真实案例的执行日志
下面是Redis OOM Runbook的一次实际执行记录:
[2025-11-15 03:17:42] 告警触发,匹配Runbook: Redis节点OOM处理
[2025-11-15 03:17:42] 开始执行步骤1: check_current_status
[2025-11-15 03:17:43] 步骤1完成 (1.2s)
输出: used_memory: 8589934592 (8.00G), maxmemory: 8589934592 (8.00G)
[2025-11-15 03:17:43] 开始执行步骤2: check_replication
[2025-11-15 03:17:44] 步骤2完成 (0.8s)
输出: master节点 192.168.1.10:6379 - 正常
slave节点 192.168.1.11:6379 - connected, lag=0s
[2025-11-15 03:17:44] 开始执行步骤3: analyze_memory (LLM分析)
[2025-11-15 03:17:48] 步骤3完成 (4.1s)
分析结果: Redis当前内存使用量已达maxmemory上限(8GB),主从复制状态正常。
建议优先清理过期Key,并检查是否有大Key占用过多内存。
[2025-11-15 03:17:48] 发送钉钉通知给值班工程师
[2025-11-15 03:17:49] 开始执行步骤5: flush_expired_keys
[2025-11-15 03:17:51] 步骤5完成 (1.8s)
[2025-11-15 03:17:51] 步骤6需要人工审批: emergency_eviction
审批消息已发送至钉钉
[2025-11-15 03:19:23] 收到审批: 批准 (操作人: 张工)
[2025-11-15 03:19:23] 开始执行步骤6
[2025-11-15 03:19:24] 步骤6完成 (0.6s)
[2025-11-15 03:19:24] 开始执行步骤7: verify_recovery
[2025-11-15 03:19:25] 步骤7完成 (0.9s)
输出: used_memory_human: 6.21G (恢复正常)
[2025-11-15 03:19:25] Runbook执行完成,总耗时: 103s
[2025-11-15 03:19:26] 更新Jira工单: 执行摘要已写入全程103秒,工程师只需要在钉钉上点了一下确认按钮。相比以前的45分钟,这是质的提升。
踩坑经验
坑1:步骤超时设置太短
Shell命令执行某些Redis操作时,如果集群规模大,执行时间可能比预期长很多。建议每个步骤的超时设置要宽裕,宁可等,不要因为超时误判步骤失败。
坑2:模板变量渲染失败没有默认值
有个步骤用了{{ steps.check_replication.output }},但上一步失败了,output是null,导致渲染报错。现在在模板引擎里加了默认值处理:{{ steps.xxx.output | default('上一步无输出') }}。
坑3:人工审批消息没有足够上下文
早期审批消息只说"是否执行此步骤",工程师凌晨被叫起来根本不知道当前故障状态,只能点击拒绝然后自己去排查。后来在审批消息里加入了前面几步的分析结果摘要,批准率从60%提升到91%。
这类系统真正的挑战不是技术,而是信任建立的过程。第一次让Agent自动执行生产环境的操作,所有人都会很紧张。要从最低风险的步骤开始,把执行记录做得足够透明,让工程师能清楚看到每一步在做什么。信任是一点一点积累的。
