AI Agent开发避坑指南：我在生产环境踩过的那些坑

老张2026/4/19大约 21 分钟Agent避坑生产经验Spring AIJavaTool Calling

AI Agent开发避坑指南：我在生产环境踩过的那些坑

开篇：那个让账单爆炸的夜晚

2025年10月的一个周五下午，某互联网公司的技术负责人老周给我发了一条语音。

语音里他的声音很平静，但内容让我冷汗直冒："老张，我们的Agent服务今天下午三点到五点，调用了外部天气API 4,273次，账单超了限额。运维直接把服务kill掉了，现在复盘，不知道发生了什么。"

后来我们找到了原因：老周团队实现了一个"旅行规划Agent"，可以调用天气API、地图API、酒店API。某个用户输入了一条需求："帮我规划一个包含10个城市的2周旅行计划。"

Agent进入了一个循环：

调用天气API查询城市A的天气
发现需要对比城市A和城市B的天气，再次调用
发现需要对比未来14天每天的天气，调用14次
发现需要获取备用城市的天气……

直到API限额被耗尽，系统才强制停止。整个过程2小时内调用了4,273次API，产生了约¥3,200的意外账单。

这是AI Agent生产事故中最典型的一类——工具调用失控。今天我要把Agent开发中最常见的6个坑，一个一个拆开来讲，每个坑附上完整的Java防护代码。

坑1：工具调用幂等性缺失

问题描述

Agent可能会因为网络超时、模型推理不稳定等原因，对同一个工具发起重复调用。如果你的工具是"发送短信""扣款""创建订单"这类操作，重复调用会产生灾难性后果。

真实案例：某团队实现了一个"智能报销Agent"，用户上传发票后Agent自动提交报销单。测试时发现，同一张发票被提交了3次报销请求，原因是第一次调用耗时较长（约8秒），模型误判为失败，重新调用了工具。

防护方案

package com.example.agent.tools;

import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Component;

import java.util.concurrent.TimeUnit;
import java.util.function.Supplier;

/**
 * 工具幂等性保护器
 * 通过Redis分布式锁确保相同参数的工具在TTL内只执行一次
 *
 * 使用方式：在工具方法开头调用 idempotent.execute(key, ttl, () -> 业务逻辑)
 */
@Component
public class ToolIdempotencyGuard {

    private final StringRedisTemplate redisTemplate;

    private static final String KEY_PREFIX = "tool_idempotency:";

    public ToolIdempotencyGuard(StringRedisTemplate redisTemplate) {
        this.redisTemplate = redisTemplate;
    }

    /**
     * 带幂等性保护的工具执行
     *
     * @param idempotencyKey 幂等性Key（通常是工具名+参数的组合）
     * @param ttlSeconds     TTL秒数（在此时间内，相同Key的重复调用返回缓存结果）
     * @param action         实际的工具逻辑
     * @return 工具执行结果
     */
    public String execute(String idempotencyKey, int ttlSeconds, Supplier<String> action) {
        String redisKey = KEY_PREFIX + idempotencyKey;

        // 检查是否已执行过
        String cachedResult = redisTemplate.opsForValue().get(redisKey);
        if (cachedResult != null) {
            return "[幂等返回] " + cachedResult;
        }

        // 执行工具逻辑
        String result = action.get();

        // 缓存结果
        redisTemplate.opsForValue().set(redisKey, result, ttlSeconds, TimeUnit.SECONDS);

        return result;
    }

    /**
     * 生成标准幂等Key：工具名 + 参数Hash
     */
    public static String buildKey(String toolName, Object... params) {
        StringBuilder sb = new StringBuilder(toolName);
        for (Object param : params) {
            sb.append(":").append(param);
        }
        return String.valueOf(sb.toString().hashCode());
    }
}

package com.example.agent.tools;

import org.springframework.ai.tool.annotation.Tool;
import org.springframework.stereotype.Component;

/**
 * 报销工具（带幂等性保护）
 */
@Component
public class ExpenseTools {

    private final ToolIdempotencyGuard idempotencyGuard;
    private final ExpenseService expenseService; // 业务Service

    public ExpenseTools(ToolIdempotencyGuard idempotencyGuard,
                         ExpenseService expenseService) {
        this.idempotencyGuard = idempotencyGuard;
        this.expenseService = expenseService;
    }

    @Tool(description = "提交报销申请。参数：invoiceId=发票ID，amount=金额（分），description=说明")
    public String submitExpense(String invoiceId, Long amount, String description) {
        // 幂等Key：基于发票ID（相同发票在10分钟内只能提交一次）
        String key = ToolIdempotencyGuard.buildKey("submitExpense", invoiceId);

        return idempotencyGuard.execute(key, 600, () -> {
            String result = expenseService.submit(invoiceId, amount, description);
            return String.format("报销申请已提交，单据编号：%s，金额：%.2f元",
                result, amount / 100.0);
        });
    }

    @Tool(description = "发送通知短信。参数：userId=用户ID，message=消息内容")
    public String sendNotification(String userId, String message) {
        // 相同用户+相同消息，5分钟内只发送一次
        String key = ToolIdempotencyGuard.buildKey("sendNotification", userId,
            message.hashCode());

        return idempotencyGuard.execute(key, 300, () -> {
            expenseService.sendSms(userId, message);
            return "通知已发送";
        });
    }
}

坑2：上下文窗口溢出

问题描述

Agent的对话历史会随着工具调用不断增长。每一轮工具调用都会往消息列表里追加：

模型的工具调用请求（ToolCall消息）
工具执行结果（ToolResult消息）

一个复杂任务可能经历20+轮工具调用，累积几万个token，最终超过模型的上下文窗口限制，导致崩溃或截断。

真实数据：老周团队的旅行规划Agent，每轮天气API调用会产生约400tokens的历史记录，100轮调用就是40,000tokens，已经触及GPT-4o的上下文限制。

防护方案

package com.example.agent.context;

import org.springframework.ai.chat.messages.Message;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.List;

/**
 * Agent上下文窗口管理器
 * 防止对话历史无限增长导致上下文溢出
 */
@Component
public class AgentContextManager {

    // Agent任务的Token预算分配
    private static final int SYSTEM_PROMPT_BUDGET = 500;   // System Prompt
    private static final int TASK_BUDGET = 500;             // 任务描述
    private static final int TOOLS_BUDGET = 1000;           // 工具Schema描述
    private static final int HISTORY_BUDGET = 4000;         // 对话历史（最重要的限制）
    private static final int RESPONSE_BUDGET = 2000;        // 留给模型输出

    // 总Budget：8000 tokens（保守估计，GPT-4o支持128K）
    // 实际可以更大，但过大的历史会降低模型注意力
    private static final int TOTAL_HISTORY_BUDGET = HISTORY_BUDGET;

    // 每条消息的估算Token数（粗略估算，避免精确计算的性能开销）
    private static final int AVG_TOKENS_PER_MESSAGE = 150;

    /**
     * 裁剪Agent历史消息，保持在Token预算内
     * 策略：保留系统消息 + 最新的N轮工具调用记录
     */
    public List<Message> trimHistory(List<Message> history) {
        if (history.isEmpty()) return history;

        int estimatedTokens = history.size() * AVG_TOKENS_PER_MESSAGE;

        if (estimatedTokens <= TOTAL_HISTORY_BUDGET) {
            return history; // 未超出预算，不裁剪
        }

        // 计算可以保留的消息数量
        int maxMessages = TOTAL_HISTORY_BUDGET / AVG_TOKENS_PER_MESSAGE;

        // 保留最新的maxMessages条消息（工具调用要成对保留）
        int startIndex = Math.max(0, history.size() - maxMessages);

        // 确保从User消息开始（不要从ToolResult开始）
        while (startIndex < history.size() &&
               !(history.get(startIndex) instanceof SystemMessage) &&
               isToolResultMessage(history.get(startIndex))) {
            startIndex++;
        }

        List<Message> trimmed = new ArrayList<>();
        // 始终保留第一条SystemMessage（任务描述）
        if (!history.isEmpty() && history.get(0) instanceof SystemMessage) {
            trimmed.add(history.get(0));
        }
        // 添加截断提示
        trimmed.add(new SystemMessage(
            "[注意：部分早期历史记录已因上下文长度限制被截断。以下是最近的操作记录。]"));
        // 添加最新的历史
        trimmed.addAll(history.subList(startIndex, history.size()));

        return trimmed;
    }

    private boolean isToolResultMessage(Message msg) {
        // Spring AI中工具结果消息的类型判断
        return msg.getClass().getSimpleName().contains("Tool");
    }
}

Turn计数器：防止无限循环

package com.example.agent.context;

import org.springframework.stereotype.Component;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * Agent执行轮次计数器
 * 防止Agent进入无限循环
 */
@Component
public class AgentTurnCounter {

    private final ConcurrentHashMap<String, AtomicInteger> turnCounters =
        new ConcurrentHashMap<>();

    // 单次任务最大允许的工具调用轮次
    private static final int MAX_TURNS_DEFAULT = 15;
    private static final int MAX_TURNS_COMPLEX = 30;  // 复杂任务

    /**
     * 记录一次工具调用，并检查是否超出限制
     * @return true表示正常，false表示已超出限制
     */
    public boolean incrementAndCheck(String agentTaskId) {
        return incrementAndCheck(agentTaskId, MAX_TURNS_DEFAULT);
    }

    public boolean incrementAndCheck(String agentTaskId, int maxTurns) {
        AtomicInteger counter = turnCounters.computeIfAbsent(
            agentTaskId, k -> new AtomicInteger(0));

        int turns = counter.incrementAndGet();
        return turns <= maxTurns;
    }

    public int getCurrentTurns(String agentTaskId) {
        AtomicInteger counter = turnCounters.get(agentTaskId);
        return counter == null ? 0 : counter.get();
    }

    public void reset(String agentTaskId) {
        turnCounters.remove(agentTaskId);
    }
}

package com.example.agent.tools;

import org.springframework.stereotype.Component;

import java.util.Map;
import java.util.Set;
import java.util.function.Predicate;
import java.util.concurrent.ConcurrentHashMap;

/**
 * 工具参数校验层
 * 在工具执行前校验参数的合法性，防止幻觉参数导致错误
 */
@Component
public class ToolParameterValidator {

    // 枚举型参数的合法值集合
    private final Map<String, Set<String>> enumValidValues = new ConcurrentHashMap<>();
    // 自定义校验器
    private final Map<String, Predicate<String>> customValidators = new ConcurrentHashMap<>();

    /**
     * 注册枚举参数的合法值
     */
    public void registerEnumValues(String paramKey, Set<String> validValues) {
        enumValidValues.put(paramKey, validValues);
    }

    /**
     * 注册自定义校验器
     */
    public void registerValidator(String paramKey, Predicate<String> validator) {
        customValidators.put(paramKey, validator);
    }

    /**
     * 校验参数，返回校验结果
     */
    public ValidationResult validate(String paramKey, String value) {
        // 1. 枚举值校验
        Set<String> validValues = enumValidValues.get(paramKey);
        if (validValues != null && !validValues.isEmpty()) {
            if (!validValues.contains(value)) {
                return ValidationResult.fail(
                    String.format("参数'%s'的值'%s'无效。合法值为：%s",
                        paramKey, value, String.join(", ", validValues))
                );
            }
        }

        // 2. 自定义校验
        Predicate<String> validator = customValidators.get(paramKey);
        if (validator != null && !validator.test(value)) {
            return ValidationResult.fail(
                String.format("参数'%s'的值'%s'不符合格式要求", paramKey, value)
            );
        }

        return ValidationResult.ok();
    }

    public record ValidationResult(boolean valid, String errorMessage) {
        public static ValidationResult ok() {
            return new ValidationResult(true, null);
        }
        public static ValidationResult fail(String msg) {
            return new ValidationResult(false, msg);
        }
    }
}

package com.example.agent.tools;

import jakarta.annotation.PostConstruct;
import org.springframework.ai.tool.annotation.Tool;
import org.springframework.stereotype.Component;

import java.util.Set;

/**
 * CRM查询工具（带参数校验）
 */
@Component
public class CrmQueryTools {

    private final ToolParameterValidator validator;
    private final CrmService crmService;

    public CrmQueryTools(ToolParameterValidator validator, CrmService crmService) {
        this.validator = validator;
        this.crmService = crmService;
    }

    @PostConstruct
    public void initValidators() {
        // 注册分公司代码的合法值（从数据库或配置加载）
        validator.registerEnumValues("branchCode", Set.of(
            "BEIJING-CORP-01", "SHANGHAI-CORP-01", "GUANGZHOU-CORP-01",
            "SHENZHEN-CORP-01", "CHENGDU-CORP-01"
        ));

        // 注册时间范围格式校验（YYYY-MM格式）
        validator.registerValidator("yearMonth",
            s -> s != null && s.matches("\\d{4}-(0[1-9]|1[0-2])"));
    }

    @Tool(description = """
            查询分公司销售数据。
            branchCode: 分公司代码，必须是以下之一：
            BEIJING-CORP-01（北京）, SHANGHAI-CORP-01（上海）, 
            GUANGZHOU-CORP-01（广州）, SHENZHEN-CORP-01（深圳）, CHENGDU-CORP-01（成都）
            startMonth: 开始月份，格式YYYY-MM，如2025-10
            endMonth: 结束月份，格式YYYY-MM
            """)
    public String querySalesData(String branchCode, String startMonth, String endMonth) {
        // 参数校验
        ToolParameterValidator.ValidationResult branchCheck =
            validator.validate("branchCode", branchCode);
        if (!branchCheck.valid()) {
            // 返回有意义的错误，让Agent自我纠正
            return "参数错误：" + branchCheck.errorMessage() +
                   "\n请使用正确的分公司代码重新调用。";
        }

        ToolParameterValidator.ValidationResult startCheck =
            validator.validate("yearMonth", startMonth);
        if (!startCheck.valid()) {
            return "参数错误：startMonth" + startCheck.errorMessage();
        }

        // 执行实际查询
        return crmService.querySales(branchCode, startMonth, endMonth);
    }
}

坑4：超时风暴

问题描述

外部API工具响应慢时，会引发连锁反应：

工具调用超时（默认30秒）
Agent等待超时后重试
多个并发Agent实例同时重试
外部API被打爆，响应更慢
超时更多，重试更多……

防护方案：Resilience4j全套配置

package com.example.agent.resilience;

import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.retry.annotation.Retry;
import io.github.resilience4j.timelimiter.annotation.TimeLimiter;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.tool.annotation.Tool;
import org.springframework.stereotype.Component;

import java.util.concurrent.CompletableFuture;

/**
 * 外部API工具（带完整容错保护）
 * 使用Resilience4j的断路器 + 重试 + 超时三重防护
 */
@Slf4j
@Component
public class WeatherApiTool {

    private final WeatherApiClient weatherApiClient;

    public WeatherApiTool(WeatherApiClient weatherApiClient) {
        this.weatherApiClient = weatherApiClient;
    }

    @Tool(description = "查询指定城市的天气信息。city: 城市名（中文），如北京、上海")
    @CircuitBreaker(name = "weather-api", fallbackMethod = "weatherFallback")
    @Retry(name = "weather-api")
    @TimeLimiter(name = "weather-api")
    public CompletableFuture<String> queryWeather(String city) {
        return CompletableFuture.supplyAsync(() -> {
            log.info("[WeatherTool] 查询城市天气: {}", city);
            return weatherApiClient.query(city);
        });
    }

    /**
     * 断路器降级方法：当天气API不可用时的fallback
     */
    public CompletableFuture<String> weatherFallback(String city, Exception ex) {
        log.warn("[WeatherTool] 天气API不可用，使用降级响应: {}", ex.getMessage());
        return CompletableFuture.completedFuture(
            String.format("天气服务暂时不可用（%s）。建议告知用户稍后再试，或使用历史天气数据估算。",
                ex.getClass().getSimpleName())
        );
    }
}

# application.yml中的Resilience4j配置
resilience4j:
  # 断路器配置
  circuitbreaker:
    instances:
      weather-api:
        sliding-window-size: 10              # 统计最近10次调用
        failure-rate-threshold: 50           # 失败率超50%开启断路器
        wait-duration-in-open-state: 30s     # 断路器开启后等待30秒再半开
        permitted-calls-in-half-open-state: 3

  # 重试配置
  retry:
    instances:
      weather-api:
        max-attempts: 2                      # 最多重试1次（总共2次）
        wait-duration: 1s
        retry-exceptions:
          - java.net.SocketTimeoutException
          - java.io.IOException
        ignore-exceptions:
          - com.example.agent.exception.InvalidParameterException

  # 超时配置
  timelimiter:
    instances:
      weather-api:
        timeout-duration: 5s               # 单次调用最大5秒
        cancel-running-future: true

package com.example.agent.executor;

import lombok.extern.slf4j.Slf4j;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;

import java.util.concurrent.Executor;
import java.util.concurrent.Semaphore;
import java.util.concurrent.ConcurrentHashMap;

/**
 * Agent执行器配置
 * 通过线程池隔离和信号量限流防止并发资源竞争
 */
@Slf4j
@Configuration
public class AgentExecutorConfig {

    /**
     * Agent专用线程池（与Web请求线程池隔离）
     */
    @Bean("agentExecutor")
    public Executor agentExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(10);        // 核心线程数
        executor.setMaxPoolSize(20);         // 最大线程数
        executor.setQueueCapacity(50);       // 队列容量（超出则拒绝）
        executor.setThreadNamePrefix("agent-");
        executor.setRejectedExecutionHandler((r, pool) -> {
            log.warn("[AgentExecutor] 线程池已满，拒绝新的Agent任务。" +
                     "活跃线程: {}, 队列大小: {}",
                pool.getActiveCount(), pool.getQueue().size());
            throw new AgentBusyException("系统繁忙，请稍后重试");
        });
        executor.initialize();
        return executor;
    }

    /**
     * 文档处理专用线程池（IO密集型，池更大）
     */
    @Bean("documentProcessingExecutor")
    public Executor documentProcessingExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(5);
        executor.setMaxPoolSize(10);
        executor.setQueueCapacity(100);
        executor.setThreadNamePrefix("doc-process-");
        executor.initialize();
        return executor;
    }
}

package com.example.agent.executor;

import org.springframework.stereotype.Component;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Semaphore;

/**
 * 外部API调用限流器
 * 防止Agent并发时触发外部API的频率限制
 */
@Component
public class ApiRateLimiter {

    // 每个外部API的并发调用信号量
    private final ConcurrentHashMap<String, Semaphore> semaphores =
        new ConcurrentHashMap<>();

    /**
     * 注册API的并发限制
     * @param apiName    API名称
     * @param maxConcurrent 最大并发数
     */
    public void register(String apiName, int maxConcurrent) {
        semaphores.put(apiName, new Semaphore(maxConcurrent, true));
    }

    /**
     * 带限流的API调用
     */
    public <T> T callWithLimit(String apiName, java.util.concurrent.Callable<T> callable)
            throws Exception {
        Semaphore semaphore = semaphores.get(apiName);
        if (semaphore == null) {
            return callable.call(); // 未注册限流，直接调用
        }

        boolean acquired = semaphore.tryAcquire(5, java.util.concurrent.TimeUnit.SECONDS);
        if (!acquired) {
            throw new AgentBusyException(
                String.format("API[%s]当前并发已达上限，请稍后重试", apiName));
        }

        try {
            return callable.call();
        } finally {
            semaphore.release();
        }
    }
}

坑6：幻觉操作（Phantom Actions）

问题描述

这是最危险的一类坑：AI执行了用户没有授权的操作。

真实案例：某智能邮件助手Agent，用户说"帮我整理一下收件箱"，Agent理解为：

删除所有已读邮件（用户没说要删！）
将所有邮件归档
自动回复了部分邮件（！！！）

用户的意图只是"分类标记"，但Agent执行了不可逆的操作，且事先没有确认。

防护方案：操作分级 + 危险操作二次确认

package com.example.agent.safety;

import org.springframework.stereotype.Component;

import java.util.Set;

/**
 * 操作危险级别分类
 * 不同级别的操作采用不同的执行策略
 */
public enum OperationRisk {
    /**
     * 只读操作：查询、搜索、获取信息
     * 策略：直接执行
     */
    READ_ONLY,

    /**
     * 低风险写操作：创建草稿、添加标签、写入临时数据
     * 策略：执行并记录审计日志
     */
    LOW_RISK,

    /**
     * 中风险写操作：发送消息、创建记录
     * 策略：执行前告知用户将要执行的操作
     */
    MEDIUM_RISK,

    /**
     * 高风险操作：删除、修改关键数据、外部支付
     * 策略：暂停执行，等待用户明确确认
     */
    HIGH_RISK
}

package com.example.agent.safety;

import org.springframework.stereotype.Component;

/**
 * Agent安全执行门卫
 * 拦截高风险操作，要求用户确认
 */
@Component
public class AgentSafetyGuard {

    private final AuditLogService auditLogService;
    private final UserConfirmationService confirmationService;

    public AgentSafetyGuard(AuditLogService auditLogService,
                              UserConfirmationService confirmationService) {
        this.auditLogService = auditLogService;
        this.confirmationService = confirmationService;
    }

    /**
     * 带安全检查的工具执行
     *
     * @param agentTaskId   任务ID（用于关联确认）
     * @param toolName      工具名称
     * @param operationDesc 操作的自然语言描述（用于展示给用户）
     * @param risk          操作风险级别
     * @param action        实际执行的操作
     * @return 执行结果，若需要确认则返回确认请求
     */
    public String executeWithSafety(String agentTaskId,
                                     String toolName,
                                     String operationDesc,
                                     OperationRisk risk,
                                     java.util.function.Supplier<String> action) {
        switch (risk) {
            case READ_ONLY -> {
                return action.get();
            }

            case LOW_RISK -> {
                String result = action.get();
                auditLogService.log(agentTaskId, toolName, operationDesc, result);
                return result;
            }

            case MEDIUM_RISK -> {
                // 执行但通知用户
                String result = action.get();
                auditLogService.log(agentTaskId, toolName, operationDesc, result);
                // 可以通过WebSocket推送通知给用户
                return result + "\n[已通知用户]";
            }

            case HIGH_RISK -> {
                // 暂停，等待确认
                boolean confirmed = confirmationService.requestConfirmation(
                    agentTaskId,
                    String.format("Agent准备执行以下操作，请确认：\n\n%s\n\n" +
                                  "确认执行请回复\"是\"，取消请回复\"否\"",
                        operationDesc)
                );

                if (!confirmed) {
                    auditLogService.logBlocked(agentTaskId, toolName, operationDesc);
                    return "操作已取消（用户拒绝）";
                }

                String result = action.get();
                auditLogService.log(agentTaskId, toolName, operationDesc, result);
                return result;
            }

            default -> throw new IllegalArgumentException("Unknown risk level: " + risk);
        }
    }
}

package com.example.agent.tools;

import com.example.agent.safety.AgentSafetyGuard;
import com.example.agent.safety.OperationRisk;
import org.springframework.ai.tool.annotation.Tool;
import org.springframework.stereotype.Component;

/**
 * 邮件工具（带安全分级）
 */
@Component
public class EmailTools {

    private final AgentSafetyGuard safetyGuard;
    private final EmailService emailService;

    public EmailTools(AgentSafetyGuard safetyGuard, EmailService emailService) {
        this.safetyGuard = safetyGuard;
        this.emailService = emailService;
    }

    @Tool(description = "查询收件箱邮件列表。folder: 文件夹名，limit: 返回数量")
    public String listEmails(String folder, int limit) {
        // 只读操作，直接执行
        return safetyGuard.executeWithSafety(
            "email-agent",
            "listEmails",
            String.format("查询%s文件夹最新%d封邮件", folder, limit),
            OperationRisk.READ_ONLY,
            () -> emailService.list(folder, limit)
        );
    }

    @Tool(description = "添加邮件标签。emailId: 邮件ID，label: 标签名")
    public String addLabel(String emailId, String label) {
        // 低风险，执行+记录
        return safetyGuard.executeWithSafety(
            "email-agent",
            "addLabel",
            String.format("为邮件[%s]添加标签[%s]", emailId, label),
            OperationRisk.LOW_RISK,
            () -> emailService.addLabel(emailId, label)
        );
    }

    @Tool(description = "发送邮件回复。emailId: 原邮件ID，content: 回复内容")
    public String sendReply(String emailId, String content) {
        // 中风险，执行并通知
        return safetyGuard.executeWithSafety(
            "email-agent",
            "sendReply",
            String.format("回复邮件[%s]，内容：%s", emailId, content.substring(0, Math.min(50, content.length()))),
            OperationRisk.MEDIUM_RISK,
            () -> emailService.sendReply(emailId, content)
        );
    }

    @Tool(description = "删除邮件。emailId: 邮件ID")
    public String deleteEmail(String emailId) {
        // 高风险！必须用户确认
        return safetyGuard.executeWithSafety(
            "email-agent",
            "deleteEmail",
            String.format("永久删除邮件[%s]，此操作不可恢复", emailId),
            OperationRisk.HIGH_RISK,
            () -> emailService.delete(emailId)
        );
    }
}

七、监控方案：MDC追踪 + 工具重复调用检测

package com.example.agent.monitoring;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.MDC;
import org.springframework.stereotype.Component;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * Agent执行监控器
 * 功能：MDC追踪、Turn计数、工具重复调用检测、执行时间记录
 */
@Slf4j
@Component
public class AgentExecutionMonitor {

    private final MeterRegistry meterRegistry;

    // 存储每个任务的工具调用历史，用于检测重复调用
    private final ConcurrentHashMap<String, ConcurrentHashMap<String, AtomicInteger>>
        toolCallHistory = new ConcurrentHashMap<>();

    // 重复调用告警阈值
    private static final int DUPLICATE_CALL_THRESHOLD = 3;

    public AgentExecutionMonitor(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    /**
     * 记录工具调用，返回是否触发重复调用告警
     */
    public boolean recordToolCall(String agentTaskId, String toolName, String paramHash) {
        // MDC设置（确保日志可追踪）
        MDC.put("agentTaskId", agentTaskId);
        MDC.put("toolName", toolName);

        String callKey = toolName + ":" + paramHash;
        ConcurrentHashMap<String, AtomicInteger> taskHistory =
            toolCallHistory.computeIfAbsent(agentTaskId, k -> new ConcurrentHashMap<>());

        int callCount = taskHistory.computeIfAbsent(callKey, k -> new AtomicInteger(0))
            .incrementAndGet();

        // Micrometer计数
        meterRegistry.counter("agent.tool.calls",
            "task", agentTaskId,
            "tool", toolName
        ).increment();

        // 重复调用检测
        if (callCount >= DUPLICATE_CALL_THRESHOLD) {
            log.warn("[AgentMonitor] 检测到工具重复调用！taskId={}, tool={}, params={}, count={}",
                agentTaskId, toolName, paramHash, callCount);

            meterRegistry.counter("agent.tool.duplicate_calls",
                "tool", toolName
            ).increment();

            return true; // 触发告警
        }

        log.debug("[AgentMonitor] 工具调用: taskId={}, tool={}, callCount={}",
            agentTaskId, toolName, callCount);

        return false;
    }

    /**
     * 清理任务历史（任务完成后调用）
     */
    public void cleanupTask(String agentTaskId) {
        toolCallHistory.remove(agentTaskId);
        MDC.clear();
    }
}

八、完整的Agent服务整合

package com.example.agent.service;

import com.example.agent.context.AgentContextManager;
import com.example.agent.context.AgentTurnCounter;
import com.example.agent.monitoring.AgentExecutionMonitor;
import lombok.extern.slf4j.Slf4j;
import org.slf4j.MDC;
import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.messages.Message;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.stereotype.Service;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

/**
 * 生产级Agent执行服务
 * 整合：Turn计数、上下文管理、安全检查、监控
 */
@Slf4j
@Service
public class ProductionAgentService {

    private final ChatClient chatClient;
    private final AgentTurnCounter turnCounter;
    private final AgentContextManager contextManager;
    private final AgentExecutionMonitor monitor;

    // Agent任务的最大执行轮次
    private static final int MAX_TURNS = 15;

    private static final String AGENT_SYSTEM_PROMPT = """
            你是一个企业内部助手Agent。
            
            重要约束：
            1. 每个工具只在真正需要时调用一次，避免重复调用
            2. 如果工具返回错误，仔细分析错误原因，修正参数后再试，最多重试1次
            3. 如果连续2次工具调用失败，停止并告知用户无法完成
            4. 收集到足够信息后立即给出最终答案，不要过度收集信息
            5. 涉及删除、发送等不可逆操作，必须先明确告知用户
            """;

    public ProductionAgentService(ChatClient.Builder builder,
                                    AgentTurnCounter turnCounter,
                                    AgentContextManager contextManager,
                                    AgentExecutionMonitor monitor) {
        this.chatClient = builder.build();
        this.turnCounter = turnCounter;
        this.contextManager = contextManager;
        this.monitor = monitor;
    }

    /**
     * 执行Agent任务
     */
    public AgentResult execute(String userTask) {
        String taskId = UUID.randomUUID().toString().substring(0, 8);
        MDC.put("agentTaskId", taskId);

        log.info("[Agent] 任务开始: taskId={}, task={}", taskId, userTask);

        List<Message> history = new ArrayList<>();
        history.add(new UserMessage(userTask));

        try {
            // Agent执行循环（带Turn限制）
            for (int turn = 0; turn < MAX_TURNS; turn++) {
                // 检查Turn限制
                if (!turnCounter.incrementAndCheck(taskId)) {
                    log.warn("[Agent] 达到最大Turn数限制: taskId={}, maxTurns={}",
                        taskId, MAX_TURNS);
                    return AgentResult.maxTurnsExceeded(taskId, turn);
                }

                // 裁剪历史消息（防止上下文溢出）
                List<Message> trimmedHistory = contextManager.trimHistory(history);

                // 调用模型（模型可能返回工具调用请求或最终答案）
                var response = chatClient.prompt()
                    .system(AGENT_SYSTEM_PROMPT)
                    .messages(trimmedHistory)
                    .call()
                    .chatResponse();

                // 检查是否完成（没有工具调用请求）
                if (isTaskComplete(response)) {
                    String finalAnswer = response.getResult().getOutput().getContent();
                    log.info("[Agent] 任务完成: taskId={}, turns={}", taskId, turn + 1);
                    return AgentResult.success(taskId, finalAnswer, turn + 1);
                }

                // 工具调用（由Spring AI自动处理）
                // 此处添加工具调用记录
                log.debug("[Agent] Turn {}: 执行工具调用", turn + 1);
            }

            return AgentResult.maxTurnsExceeded(taskId, MAX_TURNS);

        } catch (Exception e) {
            log.error("[Agent] 任务异常: taskId={}", taskId, e);
            return AgentResult.error(taskId, e.getMessage());
        } finally {
            turnCounter.reset(taskId);
            monitor.cleanupTask(taskId);
            MDC.clear();
        }
    }

    private boolean isTaskComplete(
            org.springframework.ai.chat.ChatResponse response) {
        // 检查是否还有待处理的工具调用
        var output = response.getResult().getOutput();
        return output.getToolCalls() == null || output.getToolCalls().isEmpty();
    }

    public record AgentResult(String taskId, String answer, int turnsUsed,
                               AgentStatus status, String errorMessage) {
        public static AgentResult success(String id, String answer, int turns) {
            return new AgentResult(id, answer, turns, AgentStatus.SUCCESS, null);
        }
        public static AgentResult maxTurnsExceeded(String id, int turns) {
            return new AgentResult(id,
                "任务因超过最大执行步骤数而终止，请尝试将任务拆分为更小的子任务",
                turns, AgentStatus.MAX_TURNS_EXCEEDED, null);
        }
        public static AgentResult error(String id, String msg) {
            return new AgentResult(id, "任务执行出错：" + msg,
                0, AgentStatus.ERROR, msg);
        }
    }

    public enum AgentStatus {
        SUCCESS, MAX_TURNS_EXCEEDED, USER_CANCELLED, ERROR
    }
}

九、容错设计：Resilience4j在Agent场景的应用

package com.example.agent.resilience;

import io.github.resilience4j.bulkhead.annotation.Bulkhead;
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import org.springframework.ai.chat.ChatClient;
import org.springframework.stereotype.Service;

/**
 * 带完整容错保护的LLM调用封装
 */
@Service
public class ResilientLlmService {

    private final ChatClient chatClient;

    public ResilientLlmService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    /**
     * 带断路器+舱壁隔离的LLM调用
     * 舱壁隔离：限制同时调用LLM的并发数，防止线程耗尽
     */
    @CircuitBreaker(name = "llm-service", fallbackMethod = "llmFallback")
    @Bulkhead(name = "llm-service", type = Bulkhead.Type.SEMAPHORE)
    public String callLlm(String systemPrompt, String userMessage) {
        return chatClient.prompt()
            .system(systemPrompt)
            .user(userMessage)
            .call()
            .content();
    }

    public String llmFallback(String systemPrompt, String userMessage,
                               Exception ex) {
        return "AI服务暂时不可用（" + ex.getClass().getSimpleName() + "）。" +
               "请稍后重试，或联系技术支持。";
    }
}

十、生产Agent开发检查清单

上线前，对照这个清单逐项确认：

基础安全

Tool幂等性：所有写操作工具都有幂等性保护（Redis key + TTL）
Turn限制：设置了最大执行轮次（建议15-30轮）
上下文管理：实现了历史消息裁剪，防止Token溢出
参数校验：枚举型参数有合法值校验，防止幻觉参数

外部调用保护

超时设置：所有外部API工具设置了超时（建议5-10秒）
断路器：高依赖的外部服务配置了Resilience4j断路器
重试策略：重试次数不超过2次，避免重试风暴
并发限制：外部API调用有信号量并发限制

操作安全

操作分级：工具按READ_ONLY/LOW/MEDIUM/HIGH_RISK分级
高危确认：DELETE/SEND等高危操作需要用户二次确认
审计日志：所有工具调用都有日志记录（工具名+参数+结果）

监控告警

MDC追踪：每个任务有唯一taskId贯穿所有日志
重复调用检测：相同参数工具调用3次触发告警
成本监控：每个Agent任务记录Token消耗
失败告警：任务失败率超阈值时告警

降级策略

工具降级：工具不可用时有fallback响应（而非直接报错给用户）
任务降级：Agent超出Turn限制时，有用户友好的提示
舱壁隔离：Agent线程池与Web线程池隔离

Mermaid：Agent生产保障架构图

FAQ

Q1：Turn计数器的合理上限是多少？

根据任务类型：

简单信息查询（1-2个工具）：5-8轮
中等复杂任务（3-5个工具串联）：15-20轮
复杂工作流（多步骤决策）：30轮

设置太低会导致任务频繁被截断，设置太高会放大失控风险。建议先用15轮，根据生产数据调整。

Q2：高风险操作等待用户确认，用户不回复怎么办？

设置超时（如120秒），超时后自动取消操作并通知用户。同时实现异步确认机制（WebSocket推送确认请求），避免HTTP请求超时。

Q3：重复调用告警阈值设为3合适吗？

3是一个保守值。对于真正需要重复查询的场景（如分批加载数据），可以在工具注册时设置per-tool的阈值。建议对写操作工具设置阈值为1（任何重复都告警），对读操作设置为5。

Q4：怎么区分"合理的重试"和"失控的循环"？

关键指标：相同参数的重复调用。合理重试通常是因为参数修正后重新调用（参数不同），失控循环是完全相同的参数反复调用。在重复调用检测中，把参数Hash加入key就能区分这两种情况。

Q5：生产Agent和测试Agent的配置需要不同吗？

建议差异化配置：

测试：Turn上限=50，超时=30s，不开启高危确认（方便测试）
生产：Turn上限=15，超时=8s，全部安全检查开启

通过Spring Profile（@Profile("prod")）隔离配置。

Q6：这套方案对Spring AI的版本有要求吗？

需要Spring AI 1.0.0+（GA版本）。1.0.0之前的M版本Tool Calling API有变化，部分注解和类名不同。建议直接用1.0.0正式版。

AI Agent开发避坑指南：我在生产环境踩过的那些坑

AI Agent开发避坑指南：我在生产环境踩过的那些坑

开篇：那个让账单爆炸的夜晚

坑1：工具调用幂等性缺失

问题描述

防护方案

坑2：上下文窗口溢出

问题描述

防护方案

Turn计数器：防止无限循环

坑3：参数幻觉（Hallucinated Parameters）

问题描述

防护方案：参数校验层

坑4：超时风暴

问题描述

防护方案：Resilience4j全套配置

坑5：并发资源竞争

问题描述

防护方案：Agent级别的资源隔离

坑6：幻觉操作（Phantom Actions）

问题描述

防护方案：操作分级 + 危险操作二次确认

七、监控方案：MDC追踪 + 工具重复调用检测

八、完整的Agent服务整合

九、容错设计：Resilience4j在Agent场景的应用

十、生产Agent开发检查清单

基础安全

外部调用保护

操作安全

监控告警

降级策略

Mermaid：Agent生产保障架构图

FAQ