第2456篇：AI Agent的工程化之路——从单次调用到自主运行系统的架构演进

老张2026/4/30大约 7 分钟

第2456篇：AI Agent的工程化之路——从单次调用到自主运行系统的架构演进

适读人群：有LLM应用经验、想构建自主化AI系统的工程师 | 阅读时长：约18分钟 | 核心价值：搞清楚AI Agent工程化的真实挑战，避开"看起来很酷但跑不稳"的陷阱

客户提了一个需求：做一个"自动化审计助手"。

每天自己从数据仓库里抓最新数据，分析各项财务指标有没有异常，如果有异常就查对应的明细，最后生成一份审计报告，不需要人工触发，完全自主运行。

我当时觉得这不难：定时任务 + LLM，不就行了？

然后实际做起来，发现问题一堆。

LLM拿到数据后，需要判断"这个数字异常吗"，然后决定要不要深入查这条数据。但AI在判断"要不要查下去"这件事上，有时候会无休止地往下追，一个异常追着追着变成了几十次查询，把数据库查挂了。

有时候又相反——明显的异常它判断成"正常波动"，直接略过了。

这就是AI Agent和普通LLM调用的根本区别：Agent需要在每一步自主决定"接下来做什么"，这个决策链路一旦出错就会级联放大。

Agent的本质：决策循环

普通LLM调用：

输入 → LLM → 输出（一次）

Agent：

初始输入 → LLM决策 → 行动 → 观察结果 → LLM决策 → 行动 → ... → 最终输出

Agent的循环可以执行任意多步，这带来了能力，也带来了风险。

Agent架构的四个核心组件

核心实现：ReAct推理循环

/**
 * AI Agent核心推理引擎
 * 实现 ReAct（Reason + Act）模式
 */
@Service
@Slf4j
public class AgentExecutionEngine {

    private final LLMClient llmClient;
    private final ToolRegistry toolRegistry;
    private final AgentStateRepository stateRepo;
    private final AgentAuditLogger auditLogger;

    /**
     * 执行Agent任务
     * 
     * @param taskId 任务ID
     * @param initialInput 初始输入
     * @param constraints 执行约束（最大步数、超时等）
     */
    public AgentExecutionResult execute(
            String taskId,
            String initialInput,
            ExecutionConstraints constraints) {
        
        AgentState state = AgentState.builder()
            .taskId(taskId)
            .status(AgentStatus.RUNNING)
            .currentInput(initialInput)
            .stepCount(0)
            .observations(new ArrayList<>())
            .build();
        
        stateRepo.save(state);
        
        int maxSteps = constraints.getMaxSteps();
        
        for (int step = 0; step < maxSteps; step++) {
            // 检查超时
            if (isTimeout(state, constraints.getTimeoutMs())) {
                return AgentExecutionResult.timeout(taskId, state);
            }
            
            // ReAct步骤：让LLM思考并决定下一步行动
            AgentDecision decision = reason(state);
            
            log.info("Agent步骤 {}/{}: taskId={} thought='{}' action={}",
                step + 1, maxSteps, taskId,
                decision.getThought(), decision.getAction());
            
            // 审计日志
            auditLogger.logStep(taskId, step, decision);
            
            // 如果LLM决定任务完成，退出循环
            if (decision.isFinished()) {
                state.setStatus(AgentStatus.COMPLETED);
                state.setFinalAnswer(decision.getFinalAnswer());
                stateRepo.save(state);
                return AgentExecutionResult.success(taskId, decision.getFinalAnswer(), state);
            }
            
            // 如果需要人工确认（高风险操作），暂停等待
            if (decision.requiresHumanApproval()) {
                state.setStatus(AgentStatus.WAITING_HUMAN);
                stateRepo.save(state);
                return AgentExecutionResult.waitingHuman(taskId, decision, state);
            }
            
            // 执行工具调用
            ToolCallResult toolResult = executeTool(decision.getToolCall(), constraints);
            
            // 更新状态
            state.addObservation(Observation.builder()
                .step(step)
                .toolName(decision.getToolCall().getToolName())
                .toolInput(decision.getToolCall().getInput())
                .toolOutput(toolResult.getOutput())
                .executedAt(Instant.now())
                .build());
            state.incrementStep();
            stateRepo.save(state);
        }
        
        // 超过最大步数
        return AgentExecutionResult.maxStepsExceeded(taskId, state);
    }

    /**
     * ReAct推理：让LLM分析当前状态，决定下一步
     */
    private AgentDecision reason(AgentState state) {
        String prompt = buildReActPrompt(state);
        
        LLMResponse response = llmClient.call(LLMRequest.builder()
            .prompt(prompt)
            .temperature(0.0)  // Agent推理要用低温度，减少随机性
            .build());
        
        return parseDecision(response.getContent());
    }

    private String buildReActPrompt(AgentState state) {
        StringBuilder prompt = new StringBuilder();
        
        prompt.append("你是一个AI分析助手，负责完成以下任务：\n");
        prompt.append(state.getCurrentInput()).append("\n\n");
        
        // 告诉AI可用的工具
        prompt.append("可用工具：\n");
        toolRegistry.getAvailableTools().forEach(tool -> {
            prompt.append(String.format("- %s: %s\n", tool.getName(), tool.getDescription()));
        });
        
        // 历史观察
        if (!state.getObservations().isEmpty()) {
            prompt.append("\n已完成的步骤：\n");
            state.getObservations().forEach(obs -> {
                prompt.append(String.format("步骤%d: 调用%s(%s)\n结果: %s\n\n",
                    obs.getStep() + 1,
                    obs.getToolName(),
                    obs.getToolInput(),
                    obs.getToolOutput()
                ));
            });
        }
        
        // 要求ReAct格式的输出
        prompt.append("""
            请按以下格式思考并决定下一步：
            
            思考：[分析当前状态，判断下一步需要做什么]
            行动：[TOOL_CALL: 工具名(参数) 或 FINISH: 最终答案 或 WAIT_HUMAN: 需要确认的内容]
            
            如果任务已经完成，用"FINISH: 最终答案"
            如果需要人工确认敏感操作，用"WAIT_HUMAN: 描述需要确认的操作"
            """);
        
        return prompt.toString();
    }
}

工具注册与调用的安全机制

/**
 * 工具注册表
 * 统一管理Agent可以调用的所有工具，并控制权限
 */
@Service
public class ToolRegistry {

    private final Map<String, AgentTool> tools = new ConcurrentHashMap<>();

    @PostConstruct
    public void registerDefaultTools() {
        // SQL查询工具（只读，不允许写操作）
        register(new SqlQueryTool(dataSource, SqlPermission.READ_ONLY));
        
        // 文件读取工具（限制访问目录）
        register(new FileReadTool(allowedDirectories));
        
        // HTTP API调用（限制可访问的域名白名单）
        register(new HttpApiTool(allowedApiDomains));
        
        // 代码执行工具（沙箱环境，严格限制）
        register(new SandboxedCodeExecutionTool(
            timeoutMs: 5000,
            allowedModules: ["math", "json", "datetime"]
        ));
        
        // 注意：不注册以下危险工具
        // - 文件写入工具（在需要时单独授权）
        // - 数据库写操作工具（需要明确业务需求才开放）
        // - 邮件/短信发送工具（防止Agent误发通知）
    }

    /**
     * 执行工具调用（带安全检查）
     */
    public ToolCallResult executeTool(ToolCall call, ExecutionContext context) {
        AgentTool tool = tools.get(call.getToolName());
        if (tool == null) {
            return ToolCallResult.error("工具不存在: " + call.getToolName());
        }
        
        // 检查工具权限
        if (!tool.isAuthorized(context.getTaskId(), context.getOwner())) {
            return ToolCallResult.error("无权限使用工具: " + call.getToolName());
        }
        
        // 执行工具，捕获所有异常
        try {
            return tool.execute(call.getInput(), context);
        } catch (ToolExecutionException e) {
            log.error("工具执行失败: tool={} error={}", call.getToolName(), e.getMessage());
            return ToolCallResult.error("工具执行失败: " + e.getMessage());
        }
    }
}

关键工程问题：防止Agent失控

/**
 * Agent执行约束
 * 防止Agent进入无限循环或消耗过多资源
 */
@Builder
public class ExecutionConstraints {
    
    // 最大执行步数（防止死循环）
    private int maxSteps;
    
    // 总执行超时
    private long timeoutMs;
    
    // 单步工具调用超时
    private long stepTimeoutMs;
    
    // 最大token消耗（成本控制）
    private int maxTokenBudget;
    
    // 高风险操作需要人工确认的关键词
    private List<String> humanApprovalKeywords;
    
    // 禁止访问的数据表/文件
    private List<String> forbiddenResources;
    
    public static ExecutionConstraints standard() {
        return ExecutionConstraints.builder()
            .maxSteps(20)
            .timeoutMs(300_000)    // 5分钟总超时
            .stepTimeoutMs(30_000) // 每步30秒超时
            .maxTokenBudget(50000)
            .humanApprovalKeywords(Arrays.asList("删除", "清空", "发送", "转账"))
            .build();
    }
    
    public static ExecutionConstraints conservative() {
        return ExecutionConstraints.builder()
            .maxSteps(10)
            .timeoutMs(120_000)
            .stepTimeoutMs(15_000)
            .maxTokenBudget(20000)
            .humanApprovalKeywords(Arrays.asList("删除", "清空", "发送", "更新", "修改"))
            .build();
    }
}

状态持久化：Agent可以中断恢复

/**
 * Agent状态持久化
 * 支持Agent在中断后从上次位置继续执行
 */
@Service
public class AgentStateRepository {

    private final RedisTemplate<String, AgentState> redis;
    private static final Duration STATE_TTL = Duration.ofHours(24);

    public void save(AgentState state) {
        String key = "agent:state:" + state.getTaskId();
        redis.opsForValue().set(key, state, STATE_TTL);
    }

    public Optional<AgentState> load(String taskId) {
        String key = "agent:state:" + taskId;
        return Optional.ofNullable(redis.opsForValue().get(key));
    }

    /**
     * 恢复中断的Agent任务
     * 用于人工审核通过后继续执行，或从错误中恢复
     */
    public AgentExecutionResult resume(
            String taskId,
            AgentEngine engine,
            HumanApprovalResult approvalResult) {
        
        AgentState state = load(taskId)
            .orElseThrow(() -> new AgentNotFoundException(taskId));
        
        if (state.getStatus() != AgentStatus.WAITING_HUMAN) {
            throw new IllegalStateException("任务不在等待人工审核状态: " + taskId);
        }
        
        // 把人工审核结果注入状态
        state.setLastHumanApproval(approvalResult);
        state.setStatus(AgentStatus.RUNNING);
        
        // 继续执行
        return engine.continueExecution(state);
    }
}

Agent系统的监控

/**
 * Agent执行监控
 */
@Service
public class AgentMonitoringService {

    private final MeterRegistry registry;

    public void recordExecution(AgentExecutionResult result) {
        // 执行步数分布
        registry.summary("agent.steps",
            "task_type", result.getTaskType(),
            "status", result.getStatus().name()
        ).record(result.getTotalSteps());
        
        // 总耗时
        registry.timer("agent.duration",
            "task_type", result.getTaskType()
        ).record(result.getDurationMs(), TimeUnit.MILLISECONDS);
        
        // token消耗（直接等于成本）
        registry.counter("agent.tokens",
            "task_type", result.getTaskType()
        ).increment(result.getTotalTokens());
        
        // 失败原因分布
        if (result.getStatus() != AgentStatus.COMPLETED) {
            registry.counter("agent.failures",
                "reason", result.getFailureReason()
            ).increment();
        }
        
        // 人工介入率（反映Agent自主性）
        if (result.hadHumanIntervention()) {
            registry.counter("agent.human_interventions",
                "task_type", result.getTaskType()
            ).increment();
        }
    }
}

什么时候用Agent，什么时候不用

适合用Agent的场景：

任务步骤不确定，需要根据中间结果灵活决策
工具集合丰富，需要组合多种能力
允许较长执行时间（后台异步任务）
可以接受偶尔的错误，有重试机制

不适合用Agent的场景：

流程完全固定（用工作流编排更稳定）
实时响应要求（用户等待）
高准确性要求（链式错误叠加）
涉及不可逆操作（删除数据、发送消息）

当前大多数企业场景，用"半自主"的Agent更实用：AI做分析和建议，人工做最终决策和执行。完全自主的Agent，只适合风险可控、可回滚的后台任务。