第1763篇：自动化故障处理Runbook——让Agent执行标准化运维流程

老张2026/4/30大约 10 分钟

第1763篇：自动化故障处理Runbook——让Agent执行标准化运维流程

说一件让我印象很深的事。

某次凌晨三点，一个P1告警把运维同事叫起来。问题是Redis集群的某个节点OOM（内存溢出）重启了，主从切换正在进行，服务出现了20秒的写入抖动。整个处理过程：接到告警→确认问题→执行预设的处理步骤→验证恢复→写值班日志，前后花了45分钟。

但这个故障的处理流程，其实是有固定步骤的。运维团队内部有一份Runbook（标准操作手册），写得很详细，总共11个步骤，每次都是照着走。

这种场景，是Agent自动化最完美的应用场景之一。

Runbook自动化的本质

Runbook本质上是一个有序的步骤序列，每个步骤是一个具体操作：执行命令、调用API、检查状态、发送通知……

把这些步骤用Agent来执行，有几个核心价值：

消除人工响应延迟：告警触发后立即开始执行，不用等人醒来。
保证执行一致性：每次按标准步骤来，不会因为工程师状态不好而漏步骤。
自动生成执行日志：每个步骤的输入输出都记录下来，事后复盘有据可查。
人工决策点控制：高风险操作（比如重启实例）需要人工确认，低风险操作自动执行。

系统架构设计

关键设计决策：人工决策点（Human-in-the-loop）的位置。不是所有步骤都能全自动，高风险的操作必须人工确认。这个边界要画清楚。

Runbook定义格式

我们设计了一个YAML格式的Runbook定义规范：

# redis-oom-runbook.yaml
name: Redis节点OOM处理
version: "1.2"
description: 当Redis节点发生OOM时的标准处理流程
triggerConditions:
  - alertName: "RedisMemoryUsageHigh"
    threshold: "95%"
  - alertName: "RedisOOMKilled"

steps:
  - id: check_current_status
    name: 检查当前Redis集群状态
    type: shell_command
    requireApproval: false
    command: "redis-cli -h {{ alert.host }} -p {{ alert.port }} INFO memory"
    timeout: 10s
    onFailure: CONTINUE  # 即使失败也继续（状态检查失败不影响后续）
    
  - id: check_replication
    name: 检查主从复制状态
    type: shell_command
    requireApproval: false
    command: "redis-cli -h {{ alert.host }} cluster nodes"
    timeout: 10s
    onFailure: STOP
    
  - id: analyze_memory
    name: LLM分析内存使用情况
    type: llm_analysis
    requireApproval: false
    context:
      - "{{ steps.check_current_status.output }}"
      - "{{ steps.check_replication.output }}"
    question: "根据以上Redis INFO输出，分析内存使用是否正常，是否需要立即扩容或清理"
    
  - id: notify_team
    name: 通知值班团队
    type: notification
    requireApproval: false
    channel: "dingtalk"
    message: |
      Redis节点OOM告警已触发自动处理流程
      受影响节点: {{ alert.host }}:{{ alert.port }}
      当前状态分析: {{ steps.analyze_memory.output }}
      
  - id: flush_expired_keys
    name: 清理过期Key（低风险操作）
    type: shell_command
    requireApproval: false
    command: "redis-cli -h {{ alert.host }} debug sleep 0"
    description: "触发Redis主动清理过期Key，不影响正常请求"
    timeout: 30s
    
  - id: emergency_eviction
    name: 调整驱逐策略为allkeys-lru
    type: shell_command
    requireApproval: true  # 需要人工确认，因为会影响缓存命中率
    approvalMessage: "即将修改Redis驱逐策略为allkeys-lru，确认继续？"
    command: "redis-cli -h {{ alert.host }} config set maxmemory-policy allkeys-lru"
    rollbackCommand: "redis-cli -h {{ alert.host }} config set maxmemory-policy {{ original_policy }}"
    
  - id: verify_recovery
    name: 验证内存使用是否恢复正常
    type: shell_command
    requireApproval: false
    command: "redis-cli -h {{ alert.host }} INFO memory | grep used_memory_human"
    timeout: 10s
    condition: "output contains 'used_memory_human'"
    
  - id: update_incident
    name: 更新故障工单
    type: http_api
    requireApproval: false
    method: POST
    url: "{{ config.jira_base_url }}/rest/api/2/issue/{{ alert.incidentId }}/comment"
    body: |
      {
        "body": "自动处理完成。执行摘要：\n{{ execution_summary }}"
      }

这个YAML格式支持：模板变量（{{ }}语法）、步骤间依赖（{{ steps.xxx.output }}）、条件执行、人工审批、回滚命令。

Java实现：Runbook执行引擎

@Service
@Slf4j
public class RunbookExecutionEngine {
    
    @Autowired
    private ToolExecutorRegistry toolRegistry;
    
    @Autowired
    private HumanApprovalService approvalService;
    
    @Autowired
    private TemplateEngine templateEngine;
    
    @Autowired
    private ExecutionLogRepository logRepo;
    
    @Data
    public static class ExecutionContext {
        private String executionId;
        private NormalizedAlert alert;
        private Map<String, StepResult> stepResults = new HashMap<>();
        private Map<String, Object> variables = new HashMap<>();
        private ExecutionStatus status;
        private Instant startTime;
    }
    
    @Data
    @Builder
    public static class StepResult {
        private String stepId;
        private boolean success;
        private String output;
        private String errorMessage;
        private Instant startTime;
        private Instant endTime;
        private boolean skipped;
        private boolean pendingApproval;
    }
    
    public ExecutionContext execute(RunbookDefinition runbook, NormalizedAlert alert) {
        ExecutionContext ctx = new ExecutionContext();
        ctx.setExecutionId(UUID.randomUUID().toString());
        ctx.setAlert(alert);
        ctx.setStatus(ExecutionStatus.RUNNING);
        ctx.setStartTime(Instant.now());
        
        log.info("开始执行Runbook: name={}, executionId={}, alert={}", 
            runbook.getName(), ctx.getExecutionId(), alert.getAlertId());
        
        for (RunbookStep step : runbook.getSteps()) {
            if (!shouldExecuteStep(step, ctx)) {
                log.info("跳过步骤: {}", step.getId());
                ctx.getStepResults().put(step.getId(), 
                    StepResult.builder().stepId(step.getId()).skipped(true).build());
                continue;
            }
            
            StepResult result = executeStep(step, ctx);
            ctx.getStepResults().put(step.getId(), result);
            
            // 记录执行日志
            logRepo.saveStepResult(ctx.getExecutionId(), result);
            
            // 步骤失败处理
            if (!result.isSuccess() && !result.isSkipped()) {
                if (step.getOnFailure() == OnFailureAction.STOP) {
                    log.error("步骤失败，停止执行: stepId={}", step.getId());
                    ctx.setStatus(ExecutionStatus.FAILED);
                    notifyFailure(ctx, step, result);
                    return ctx;
                }
                // CONTINUE：记录失败但继续
                log.warn("步骤失败，继续执行: stepId={}", step.getId());
            }
        }
        
        ctx.setStatus(ExecutionStatus.COMPLETED);
        generateExecutionReport(ctx, runbook);
        return ctx;
    }
    
    private StepResult executeStep(RunbookStep step, ExecutionContext ctx) {
        Instant stepStart = Instant.now();
        
        // 如果需要人工审批，等待审批
        if (step.isRequireApproval()) {
            String approvalMessage = renderTemplate(step.getApprovalMessage(), ctx);
            boolean approved = approvalService.requestApproval(
                ctx.getExecutionId(), step.getId(), approvalMessage);
            
            if (!approved) {
                return StepResult.builder()
                    .stepId(step.getId())
                    .success(false)
                    .errorMessage("人工审批被拒绝")
                    .startTime(stepStart)
                    .endTime(Instant.now())
                    .build();
            }
        }
        
        // 渲染模板变量
        String renderedCommand = renderTemplate(
            step.getCommand() != null ? step.getCommand() : "", ctx);
        
        // 根据步骤类型选择执行器
        ToolExecutor executor = toolRegistry.getExecutor(step.getType());
        
        try {
            String output = executor.execute(step, renderedCommand, ctx);
            
            return StepResult.builder()
                .stepId(step.getId())
                .success(true)
                .output(output)
                .startTime(stepStart)
                .endTime(Instant.now())
                .build();
                
        } catch (Exception e) {
            log.error("步骤执行失败: stepId={}", step.getId(), e);
            
            return StepResult.builder()
                .stepId(step.getId())
                .success(false)
                .errorMessage(e.getMessage())
                .startTime(stepStart)
                .endTime(Instant.now())
                .build();
        }
    }
    
    private String renderTemplate(String template, ExecutionContext ctx) {
        Map<String, Object> variables = new HashMap<>();
        variables.put("alert", ctx.getAlert());
        variables.put("steps", ctx.getStepResults());
        variables.put("config", getSystemConfig());
        
        // 合并执行上下文中的变量
        variables.putAll(ctx.getVariables());
        
        return templateEngine.render(template, variables);
    }
}

LLM步骤执行器

LLM类型的步骤比较特殊，需要单独处理：

@Component
public class LlmStepExecutor implements ToolExecutor {
    
    @Autowired
    private OpenAiService openAiService;
    
    @Override
    public String getType() {
        return "llm_analysis";
    }
    
    @Override
    public String execute(RunbookStep step, String renderedCommand, 
                           ExecutionContext ctx) {
        // 收集上下文信息
        List<String> contextParts = new ArrayList<>();
        
        if (step.getContext() != null) {
            for (String contextRef : step.getContext()) {
                String resolved = resolveContextRef(contextRef, ctx);
                if (resolved != null) {
                    contextParts.add(resolved);
                }
            }
        }
        
        String question = step.getQuestion();
        
        String userMessage = "上下文信息:\n\n" + 
            String.join("\n---\n", contextParts) + 
            "\n\n问题: " + question;
        
        ChatCompletionRequest request = ChatCompletionRequest.builder()
            .model("gpt-4o")
            .messages(List.of(
                new ChatMessage("system", 
                    "你是一位资深SRE，请根据提供的运维数据，简洁准确地回答问题。" +
                    "回答控制在200字以内，重点突出关键判断和建议。"),
                new ChatMessage("user", userMessage)
            ))
            .temperature(0.1)
            .maxTokens(500)
            .build();
        
        return openAiService.createChatCompletion(request)
            .getChoices().get(0).getMessage().getContent();
    }
    
    private String resolveContextRef(String ref, ExecutionContext ctx) {
        // 解析 {{ steps.xxx.output }} 格式的引用
        if (ref.startsWith("{{ steps.") && ref.endsWith(".output }}")) {
            String stepId = ref.substring(9, ref.length() - 9);
            StepResult result = ctx.getStepResults().get(stepId);
            if (result != null) {
                return result.getOutput();
            }
        }
        return ref;
    }
}

Shell命令执行器（带安全控制）

@Component
@Slf4j
public class ShellCommandExecutor implements ToolExecutor {
    
    // 高危命令黑名单
    private static final List<String> DANGEROUS_COMMANDS = List.of(
        "rm -rf", "dd if=", "mkfs", "fdisk", "> /dev/",
        "shutdown", "reboot", "halt"
    );
    
    @Override
    public String execute(RunbookStep step, String command, ExecutionContext ctx) {
        // 安全检查
        for (String dangerous : DANGEROUS_COMMANDS) {
            if (command.contains(dangerous)) {
                throw new SecurityException("命令包含危险操作，拒绝执行: " + dangerous);
            }
        }
        
        // 执行前记录
        log.info("执行Shell命令: executionId={}, stepId={}, command={}", 
            ctx.getExecutionId(), step.getId(), command);
        
        int timeoutSeconds = parseTimeout(step.getTimeout());
        
        try {
            ProcessBuilder pb = new ProcessBuilder("bash", "-c", command);
            pb.redirectErrorStream(true);
            
            Process process = pb.start();
            
            // 读取输出
            StringBuilder output = new StringBuilder();
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(process.getInputStream()))) {
                String line;
                int lineCount = 0;
                while ((line = reader.readLine()) != null && lineCount < 200) {
                    output.append(line).append("\n");
                    lineCount++;
                }
                if (lineCount >= 200) {
                    output.append("[输出过长，已截断]");
                }
            }
            
            boolean finished = process.waitFor(timeoutSeconds, TimeUnit.SECONDS);
            if (!finished) {
                process.destroyForcibly();
                throw new TimeoutException("命令执行超时: " + timeoutSeconds + "s");
            }
            
            int exitCode = process.exitValue();
            if (exitCode != 0) {
                throw new RuntimeException(
                    String.format("命令退出码非零: %d, 输出: %s", exitCode, output));
            }
            
            return output.toString().trim();
            
        } catch (SecurityException e) {
            throw e; // 安全异常直接向上抛
        } catch (Exception e) {
            throw new RuntimeException("命令执行失败: " + e.getMessage(), e);
        }
    }
    
    private int parseTimeout(String timeout) {
        if (timeout == null) return 60;
        if (timeout.endsWith("s")) {
            return Integer.parseInt(timeout.replace("s", ""));
        }
        if (timeout.endsWith("m")) {
            return Integer.parseInt(timeout.replace("m", "")) * 60;
        }
        return 60;
    }
}

人工审批流程

凌晨三点触发的高危步骤，怎么让工程师能快速审批？

@Service
@Slf4j
public class DingtalkApprovalService implements HumanApprovalService {
    
    private final Map<String, CompletableFuture<Boolean>> pendingApprovals = 
        new ConcurrentHashMap<>();
    
    @Override
    public boolean requestApproval(String executionId, String stepId, String message) {
        String approvalKey = executionId + ":" + stepId;
        
        CompletableFuture<Boolean> future = new CompletableFuture<>();
        pendingApprovals.put(approvalKey, future);
        
        // 发送钉钉审批消息，带可交互按钮
        sendDingtalkApprovalCard(executionId, stepId, message, approvalKey);
        
        try {
            // 等待工程师在钉钉上点击审批（最多等5分钟）
            Boolean result = future.get(5, TimeUnit.MINUTES);
            return Boolean.TRUE.equals(result);
        } catch (TimeoutException e) {
            log.warn("审批超时，默认拒绝: approvalKey={}", approvalKey);
            return false;
        } catch (Exception e) {
            log.error("审批等待异常", e);
            return false;
        } finally {
            pendingApprovals.remove(approvalKey);
        }
    }
    
    // 钉钉回调接口，处理工程师的审批点击
    @PostMapping("/approval/callback")
    public void handleApprovalCallback(@RequestBody ApprovalCallbackDto dto) {
        String approvalKey = dto.getExecutionId() + ":" + dto.getStepId();
        CompletableFuture<Boolean> future = pendingApprovals.get(approvalKey);
        
        if (future != null) {
            future.complete(dto.getApproved());
            log.info("收到审批回调: key={}, approved={}, operator={}", 
                approvalKey, dto.getApproved(), dto.getOperatorName());
        }
    }
    
    private void sendDingtalkApprovalCard(String executionId, String stepId, 
                                           String message, String approvalKey) {
        // 构建钉钉交互式卡片消息
        DingtalkCardMessage card = DingtalkCardMessage.builder()
            .title("🔧 Runbook执行需要您的确认")
            .content(message)
            .buttons(List.of(
                new CardButton("批准执行", "approve", approvalKey),
                new CardButton("拒绝跳过", "reject", approvalKey)
            ))
            .build();
        
        dingtalkService.sendCard(card);
    }
}

一个真实案例的执行日志

下面是Redis OOM Runbook的一次实际执行记录：

[2025-11-15 03:17:42] 告警触发，匹配Runbook: Redis节点OOM处理
[2025-11-15 03:17:42] 开始执行步骤1: check_current_status
[2025-11-15 03:17:43] 步骤1完成 (1.2s)
  输出: used_memory: 8589934592 (8.00G), maxmemory: 8589934592 (8.00G)
  
[2025-11-15 03:17:43] 开始执行步骤2: check_replication
[2025-11-15 03:17:44] 步骤2完成 (0.8s)
  输出: master节点 192.168.1.10:6379 - 正常
       slave节点 192.168.1.11:6379 - connected, lag=0s
       
[2025-11-15 03:17:44] 开始执行步骤3: analyze_memory (LLM分析)
[2025-11-15 03:17:48] 步骤3完成 (4.1s)
  分析结果: Redis当前内存使用量已达maxmemory上限(8GB)，主从复制状态正常。
           建议优先清理过期Key，并检查是否有大Key占用过多内存。
           
[2025-11-15 03:17:48] 发送钉钉通知给值班工程师
[2025-11-15 03:17:49] 开始执行步骤5: flush_expired_keys
[2025-11-15 03:17:51] 步骤5完成 (1.8s)

[2025-11-15 03:17:51] 步骤6需要人工审批: emergency_eviction
  审批消息已发送至钉钉
[2025-11-15 03:19:23] 收到审批: 批准 (操作人: 张工)
[2025-11-15 03:19:23] 开始执行步骤6
[2025-11-15 03:19:24] 步骤6完成 (0.6s)

[2025-11-15 03:19:24] 开始执行步骤7: verify_recovery
[2025-11-15 03:19:25] 步骤7完成 (0.9s)
  输出: used_memory_human: 6.21G (恢复正常)
  
[2025-11-15 03:19:25] Runbook执行完成，总耗时: 103s
[2025-11-15 03:19:26] 更新Jira工单: 执行摘要已写入

全程103秒，工程师只需要在钉钉上点了一下确认按钮。相比以前的45分钟，这是质的提升。

踩坑经验

坑1：步骤超时设置太短

Shell命令执行某些Redis操作时，如果集群规模大，执行时间可能比预期长很多。建议每个步骤的超时设置要宽裕，宁可等，不要因为超时误判步骤失败。

坑2：模板变量渲染失败没有默认值

有个步骤用了{{ steps.check_replication.output }}，但上一步失败了，output是null，导致渲染报错。现在在模板引擎里加了默认值处理：{{ steps.xxx.output | default('上一步无输出') }}。

坑3：人工审批消息没有足够上下文

早期审批消息只说"是否执行此步骤"，工程师凌晨被叫起来根本不知道当前故障状态，只能点击拒绝然后自己去排查。后来在审批消息里加入了前面几步的分析结果摘要，批准率从60%提升到91%。

这类系统真正的挑战不是技术，而是信任建立的过程。第一次让Agent自动执行生产环境的操作，所有人都会很紧张。要从最低风险的步骤开始，把执行记录做得足够透明，让工程师能清楚看到每一步在做什么。信任是一点一点积累的。