Agent 的重试和幂等——AI 调用失败了怎么优雅地重来

老张2026/4/30大约 10 分钟

Agent 的重试和幂等——AI 调用失败了怎么优雅地重来

去年有一次生产事故，让我对 Agent 重试这件事有了深刻认识。

我们做了一个自动化运营 Agent，它的工作是每天早上自动给用户发优惠券。某天凌晨 3 点，它调用优惠券发放接口时网络抖了一下，返回了 timeout。

Agent 的重试逻辑很"标准"：失败了就重试，最多重试 3 次。

你猜怎么着？每次重试都成功了。

凌晨 6 点客服就开始被投诉：有用户收到了 4 张同样的优惠券，还不能叠加使用——因为优惠券系统限制了同类券不能同时使用。

这个 bug 的直接后果是我们紧急下线了几千张多发的券，人工逐一处理，搞了整整一天。

后来我仔细复盘，问题的根本不在于重试本身，而在于：我们没有在设计阶段思考清楚"什么错误能重试，什么错误不能重试"，以及"如何保证重试不产生副作用"。

一、AI 调用失败的分类

在谈重试策略之前，先把错误分个类。这一步很多人跳过了，但它是整个重试设计的基础。

第一类：可安全重试的错误

这类错误的特征是：重试不会产生重复副作用，失败是暂时性的。

网络超时（timeout）
服务器暂时过载（503 Service Unavailable）
速率限制（429 Too Many Requests）
短暂的连接失败

这类错误，直接重试就行，但需要等待（backoff）。

第二类：不能重试的错误

这类错误的特征是：重试没有意义，或者会产生不希望的结果。

认证失败（401、403）——重试也是失败
请求格式错误（400）——参数本身有问题，重试还是错
资源不存在（404）——重试没有意义
业务逻辑错误——如余额不足、库存不够

这类错误，立即失败，不要重试，直接把问题抛给上层处理。

第三类：需要幂等保护才能重试的错误

这是最容易踩坑的一类。操作成功了，但你没收到成功确认——比如我上面说的那个优惠券案例。

网络超时发生在服务端已处理之后
连接断开，不知道服务端有没有处理
响应丢失

这类错误必须配合幂等机制才能安全重试。

二、幂等性是什么，为什么 Agent 特别需要它

幂等性（Idempotency）的定义是：同一个操作执行一次和执行多次，结果完全相同。

对于查询类操作，天然幂等，查多少次都没问题。

问题出在写操作上：发券、下单、发邮件、删除文件……这些操作执行一次和执行两次，结果是不一样的。

Agent 为什么特别需要关注幂等性？

因为 Agent 的调用链条更长，每个节点的失败概率都会叠加。一个需要调用 5 个工具的 Agent 任务，假设每个工具调用的成功率是 98%，整个任务成功率就只有 0.98^5 ≈ 90%。调用链越长，中途失败的概率越高，需要重试的可能性就越大。

三、幂等键的设计

解决幂等的标准方案是幂等键（Idempotency Key）。

核心思路：在发起请求时生成一个唯一 ID，服务端用这个 ID 来识别重复请求。如果已经处理过相同 ID 的请求，直接返回缓存的结果，而不是重新执行。

/**
 * 幂等请求包装器
 * 每一次 Agent 工具调用都应该携带幂等键
 */
@Data
@Builder
public class IdempotentRequest<T> {
    
    /**
     * 幂等键：由任务ID + 工具名 + 调用序号 组合生成
     * 同一个任务的同一个工具调用，幂等键必须相同
     */
    private String idempotencyKey;
    
    /**
     * 实际请求参数
     */
    private T payload;
    
    /**
     * 生成幂等键的工厂方法
     * @param taskId Agent 任务的唯一ID
     * @param toolName 工具名称
     * @param callIndex 本次调用在任务中的序号
     */
    public static String generateKey(String taskId, String toolName, int callIndex) {
        return String.format("%s_%s_%d", taskId, toolName, callIndex);
    }
}

/**
 * 带幂等保护的工具调用执行器
 */
@Component
@Slf4j
public class IdempotentToolExecutor {
    
    @Autowired
    private RedisTemplate<String, String> redisTemplate;
    
    private static final String IDEMPOTENCY_PREFIX = "idem:";
    private static final Duration IDEMPOTENCY_TTL = Duration.ofHours(24);
    
    /**
     * 执行带幂等保护的操作
     * @param idempotencyKey 幂等键
     * @param operation 实际要执行的操作（Lambda）
     * @param resultType 返回值类型
     */
    public <T> T executeIdempotent(
            String idempotencyKey,
            Supplier<T> operation,
            Class<T> resultType) {
        
        String redisKey = IDEMPOTENCY_PREFIX + idempotencyKey;
        
        // 1. 检查是否已经有缓存结果
        String cachedResult = redisTemplate.opsForValue().get(redisKey);
        if (cachedResult != null) {
            log.info("幂等命中，直接返回缓存结果，key={}", idempotencyKey);
            return deserialize(cachedResult, resultType);
        }
        
        // 2. 使用 SETNX 加锁，防止并发重复执行
        Boolean acquired = redisTemplate.opsForValue()
            .setIfAbsent(redisKey + ":lock", "1", Duration.ofSeconds(30));
        
        if (Boolean.FALSE.equals(acquired)) {
            // 有其他线程正在执行，等待后重试
            throw new ConcurrentExecutionException("操作正在进行中，请稍后重试");
        }
        
        try {
            // 3. 再次检查（double check），防止锁等待期间其他线程已完成
            cachedResult = redisTemplate.opsForValue().get(redisKey);
            if (cachedResult != null) {
                return deserialize(cachedResult, resultType);
            }
            
            // 4. 执行实际操作
            T result = operation.get();
            
            // 5. 将结果写入缓存
            String serialized = serialize(result);
            redisTemplate.opsForValue().set(redisKey, serialized, IDEMPOTENCY_TTL);
            
            log.info("操作执行成功并缓存，key={}", idempotencyKey);
            return result;
            
        } finally {
            // 释放锁
            redisTemplate.delete(redisKey + ":lock");
        }
    }
    
    private <T> String serialize(T obj) {
        try {
            return new ObjectMapper().writeValueAsString(obj);
        } catch (JsonProcessingException e) {
            throw new RuntimeException("序列化失败", e);
        }
    }
    
    private <T> T deserialize(String json, Class<T> type) {
        try {
            return new ObjectMapper().readValue(json, type);
        } catch (JsonProcessingException e) {
            throw new RuntimeException("反序列化失败", e);
        }
    }
}

四、重试策略的工程设计

有了幂等保护，接下来设计重试策略。

退避算法的选择

不要用固定间隔重试。固定间隔在服务器已经过载的情况下，会把过载变成雪崩——所有重试同时打过来，服务器更难恢复。

使用指数退避 + 随机抖动：

/**
 * 重试配置
 */
@Data
@Builder
public class RetryConfig {
    /** 最大重试次数 */
    private int maxAttempts;
    /** 初始等待时间（毫秒） */
    private long initialDelayMs;
    /** 退避倍数 */
    private double backoffMultiplier;
    /** 最大等待时间（毫秒） */
    private long maxDelayMs;
    /** 随机抖动比例，0.0-1.0 */
    private double jitterFactor;
    
    public static RetryConfig defaultConfig() {
        return RetryConfig.builder()
            .maxAttempts(3)
            .initialDelayMs(1000)
            .backoffMultiplier(2.0)
            .maxDelayMs(30000)
            .jitterFactor(0.3)
            .build();
    }
    
    /**
     * 计算第 n 次重试的等待时间
     */
    public long calculateDelay(int attemptNumber) {
        // 指数退避
        double exponentialDelay = initialDelayMs * Math.pow(backoffMultiplier, attemptNumber - 1);
        // 取最大值限制
        double cappedDelay = Math.min(exponentialDelay, maxDelayMs);
        // 加入随机抖动：[-jitter, +jitter]
        double jitter = cappedDelay * jitterFactor * (Math.random() * 2 - 1);
        return (long) Math.max(0, cappedDelay + jitter);
    }
}

/**
 * 智能重试执行器
 * 核心能力：根据异常类型决定是否重试
 */
@Component
@Slf4j
public class SmartRetryExecutor {
    
    // 可重试的异常类型
    private static final Set<Class<? extends Exception>> RETRYABLE_EXCEPTIONS = Set.of(
        SocketTimeoutException.class,
        ConnectTimeoutException.class,
        ServiceUnavailableException.class,
        RateLimitException.class
    );
    
    // 不可重试的异常类型（直接失败）
    private static final Set<Class<? extends Exception>> NON_RETRYABLE_EXCEPTIONS = Set.of(
        AuthenticationException.class,
        InvalidRequestException.class,
        ResourceNotFoundException.class,
        InsufficientBalanceException.class
    );
    
    /**
     * 带重试逻辑的执行方法
     */
    public <T> T executeWithRetry(
            Callable<T> operation,
            RetryConfig config,
            String operationName) throws Exception {
        
        int attempt = 0;
        Exception lastException = null;
        
        while (attempt <= config.getMaxAttempts()) {
            try {
                if (attempt > 0) {
                    long delay = config.calculateDelay(attempt);
                    log.info("第{}次重试 {}，等待{}ms", attempt, operationName, delay);
                    Thread.sleep(delay);
                }
                
                T result = operation.call();
                
                if (attempt > 0) {
                    log.info("重试成功：{}，共重试{}次", operationName, attempt);
                }
                return result;
                
            } catch (Exception e) {
                lastException = e;
                
                // 判断是否可重试
                if (isNonRetryable(e)) {
                    log.warn("不可重试的错误：{} - {}", operationName, e.getMessage());
                    throw e;  // 直接抛出，不重试
                }
                
                if (!isRetryable(e)) {
                    log.warn("未知错误类型，默认不重试：{} - {}", operationName, e.getClass().getName());
                    throw e;
                }
                
                attempt++;
                if (attempt > config.getMaxAttempts()) {
                    log.error("已达最大重试次数：{}，操作失败", operationName);
                    break;
                }
                
                log.warn("可重试错误（第{}/{}次）：{} - {}", 
                    attempt, config.getMaxAttempts(), operationName, e.getMessage());
            }
        }
        
        throw new MaxRetryExceededException(
            String.format("操作 %s 在重试 %d 次后仍然失败", operationName, config.getMaxAttempts()), 
            lastException);
    }
    
    private boolean isRetryable(Exception e) {
        return RETRYABLE_EXCEPTIONS.stream()
            .anyMatch(retryableType -> retryableType.isAssignableFrom(e.getClass()));
    }
    
    private boolean isNonRetryable(Exception e) {
        return NON_RETRYABLE_EXCEPTIONS.stream()
            .anyMatch(nonRetryableType -> nonRetryableType.isAssignableFrom(e.getClass()));
    }
}

五、把幂等和重试组合起来

真实使用时，幂等保护和重试策略是配合使用的：

/**
 * Agent 工具调用的完整封装
 * 集成了幂等保护 + 智能重试
 */
@Component
@Slf4j
public class AgentToolCaller {
    
    @Autowired
    private IdempotentToolExecutor idempotentExecutor;
    
    @Autowired
    private SmartRetryExecutor retryExecutor;
    
    /**
     * 调用写操作类工具（需要幂等保护）
     */
    public <T> T callWriteTool(
            String taskId,
            String toolName,
            int callIndex,
            Callable<T> toolOperation,
            Class<T> resultType) throws Exception {
        
        // 生成幂等键
        String idempotencyKey = IdempotentRequest.generateKey(taskId, toolName, callIndex);
        
        RetryConfig retryConfig = RetryConfig.defaultConfig();
        
        return retryExecutor.executeWithRetry(
            () -> idempotentExecutor.executeIdempotent(
                idempotencyKey,
                () -> {
                    try {
                        return toolOperation.call();
                    } catch (Exception e) {
                        throw new RuntimeException(e);
                    }
                },
                resultType
            ),
            retryConfig,
            toolName
        );
    }
    
    /**
     * 调用读操作类工具（只需重试，无需幂等保护）
     */
    public <T> T callReadTool(
            String toolName,
            Callable<T> toolOperation) throws Exception {
        
        RetryConfig retryConfig = RetryConfig.builder()
            .maxAttempts(5)          // 读操作可以多重试几次
            .initialDelayMs(500)
            .backoffMultiplier(1.5)
            .maxDelayMs(10000)
            .jitterFactor(0.2)
            .build();
        
        return retryExecutor.executeWithRetry(toolOperation, retryConfig, toolName);
    }
}

// 使用示例
@Tool("发送优惠券给指定用户")
public CouponSendResult sendCoupon(
        String userId, 
        String couponTemplateId,
        // 由调用方传入任务上下文
        String taskId,
        int callIndex) throws Exception {
    
    return agentToolCaller.callWriteTool(
        taskId,
        "sendCoupon",
        callIndex,
        () -> couponService.send(userId, couponTemplateId),
        CouponSendResult.class
    );
}

六、重试决策树

七、几个容易忽略的细节

细节一：重试次数要区分操作类型

读操作可以大胆重试（5-10 次），写操作要保守（2-3 次）。写操作即便有幂等保护，每次重试都有额外的系统开销，不要无限重试。

细节二：429 错误要特殊处理

速率限制错误通常会在响应头里告诉你需要等多久：Retry-After: 30。遵守这个时间，不要自己计算退避，否则可能被封 IP。

private long getRetryAfterMs(Exception e) {
    if (e instanceof RateLimitException rle) {
        // 从响应头获取 Retry-After
        return rle.getRetryAfterSeconds() * 1000L;
    }
    return -1;  // -1 表示使用默认退避策略
}

细节三：重试要有熔断保护

当一个工具在短时间内失败次数超过阈值，应该熔断，停止重试，让 Agent 走降级路径：

@Component
public class CircuitBreakerWrappedTool {
    
    private final Map<String, CircuitBreaker> circuitBreakers = new ConcurrentHashMap<>();
    
    public <T> T callWithCircuitBreaker(String toolName, Callable<T> operation) throws Exception {
        CircuitBreaker cb = circuitBreakers.computeIfAbsent(toolName, 
            name -> CircuitBreaker.ofDefaults(name));
        
        return cb.executeCallable(operation);
    }
}

细节四：幂等键的有效期

幂等键不能永久保存，否则 Redis 会撑爆。24 小时是个合理的 TTL：对于 Agent 任务来说，超过 24 小时的重试基本上是异常场景，不应该自动重试。

八、回到那个优惠券事故

如果当时我们有这套重试+幂等机制，会发生什么？

优惠券发放接口的实现加了幂等键保护
Agent 重试时，幂等层发现这个 key 对应的操作已经完成，直接返回第一次的结果
用户只收到一张券，Agent 任务正常完成

后来我们改造了这套系统，同时做了两件事：

所有写操作工具都加了幂等键保护
所有工具调用都经过 SmartRetryExecutor，错误类型自动判断是否重试

改造上线后的 3 个月，没有再出现过因为重试导致的重复操作问题。

关键结论：在 Agent 工程里，重试策略和幂等机制是一对孪生设计，缺一不可。光有重试策略，写操作不安全；光有幂等保护，失败不能自动恢复。只有两者组合，才能让 Agent 在复杂网络环境下稳定运行。