Function Call超时与重试：生产级工具调用的可靠性设计

老张2026/4/30大约 6 分钟

Function Call超时与重试：生产级工具调用的可靠性设计

适读人群：在生产环境运行LLM工具调用系统的Java工程师 | 阅读时长：约17分钟

开篇故事

刚把工具调用系统上线时，遇到了一个让我头疼的问题：天气查询工具依赖外部API，这个API时不时会响应慢（有时要10几秒），或者直接超时。只要工具超时，整个对话就失败了，用户看到的是一个错误页面。

用户完全不知道发生了什么，体验极差。

后来我加了重试机制，第一次超时就自动重试，同时告知LLM"工具调用超时，已重试"。对于实在不可用的工具，改为返回降级数据（"暂时无法获取最新天气，上次查询结果是......"）。

这次改造让系统稳定性从95%提升到了99.5%。今天把这套可靠性设计完整分享出来。

一、工具调用的失败场景分类

超时和网络错误是可重试的（幂等操作），业务错误通常不应该重试（重试也不会成功，反而会把错误信息传给LLM让它自己决策）。

二、完整的可靠性设计

2.1 分层超时控制

// 超时配置：不同工具有不同的超时时间
@Configuration
public class ToolTimeoutConfig {

    // 工具超时配置（毫秒）
    public static final Map<String, Long> TOOL_TIMEOUTS = Map.of(
            "get_weather", 5_000L,        // 天气API，5秒
            "execute_readonly_sql", 30_000L, // SQL查询，30秒（可能很慢）
            "search_products", 3_000L,     // 商品搜索，3秒
            "send_notification", 10_000L,  // 发通知，10秒
            "generate_report", 60_000L     // 报告生成，60秒（比较慢）
    );
    
    public static final long DEFAULT_TIMEOUT = 10_000L;  // 默认10秒
    
    public static long getTimeout(String toolName) {
        return TOOL_TIMEOUTS.getOrDefault(toolName, DEFAULT_TIMEOUT);
    }
}

2.2 带重试的工具执行器

@Component
public class ReliableToolExecutor {

    @Autowired
    private ObjectMapper objectMapper;
    
    @Autowired
    private ToolFallbackRegistry fallbackRegistry;

    // 最大重试次数
    private static final int MAX_RETRIES = 2;
    // 重试基础延迟（毫秒）
    private static final long RETRY_BASE_DELAY_MS = 1000L;

    public ToolExecutionResult executeWithReliability(String toolName, 
            FunctionCallback tool, String arguments) {
        
        long timeout = ToolTimeoutConfig.getTimeout(toolName);
        Exception lastException = null;
        
        for (int attempt = 0; attempt <= MAX_RETRIES; attempt++) {
            if (attempt > 0) {
                // 指数退避：1秒、2秒、4秒...
                long delay = RETRY_BASE_DELAY_MS * (1L << (attempt - 1));
                log.info("[Tool] Retrying {} (attempt {}/{}), waiting {}ms", 
                        toolName, attempt, MAX_RETRIES, delay);
                try {
                    Thread.sleep(delay);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    break;
                }
            }
            
            try {
                String result = executeWithTimeout(tool, arguments, timeout);
                
                if (attempt > 0) {
                    log.info("[Tool] {} succeeded on retry {}", toolName, attempt);
                }
                
                return ToolExecutionResult.success(toolName, result);
                
            } catch (TimeoutException e) {
                lastException = e;
                log.warn("[Tool] {} timed out (attempt {}/{}), timeout={}ms", 
                        toolName, attempt + 1, MAX_RETRIES + 1, timeout);
                // 超时可以重试
                
            } catch (ToolRetryableException e) {
                lastException = e;
                log.warn("[Tool] {} retryable error: {}", toolName, e.getMessage());
                // 标记为可重试的错误（如网络错误）
                
            } catch (ToolNonRetryableException e) {
                // 业务错误，不重试，直接返回错误给LLM
                log.info("[Tool] {} non-retryable error: {}", toolName, e.getMessage());
                return ToolExecutionResult.businessError(toolName, e.getMessage());
            }
        }
        
        // 所有重试都失败，尝试降级
        log.error("[Tool] {} failed after {} attempts", toolName, MAX_RETRIES + 1);
        
        ToolFallback fallback = fallbackRegistry.getFallback(toolName);
        if (fallback != null) {
            log.info("[Tool] Using fallback for {}", toolName);
            try {
                String fallbackResult = fallback.execute(arguments);
                return ToolExecutionResult.fallback(toolName, fallbackResult);
            } catch (Exception e) {
                log.error("[Tool] Fallback also failed for {}: {}", toolName, e.getMessage());
            }
        }
        
        // 彻底失败，返回错误描述
        String errorMsg = lastException != null ? lastException.getMessage() : "Unknown error";
        return ToolExecutionResult.systemError(toolName, 
                "工具暂时不可用，请稍后再试。错误：" + errorMsg);
    }

    private String executeWithTimeout(FunctionCallback tool, String arguments, 
            long timeoutMs) throws TimeoutException {
        CompletableFuture<String> future = CompletableFuture.supplyAsync(() -> {
            try {
                return tool.call(arguments);
            } catch (Exception e) {
                // 包装异常，区分可重试和不可重试
                if (e instanceof BusinessException) {
                    throw new ToolNonRetryableException(e.getMessage(), e);
                }
                throw new ToolRetryableException(e.getMessage(), e);
            }
        });
        
        try {
            return future.get(timeoutMs, TimeUnit.MILLISECONDS);
        } catch (java.util.concurrent.TimeoutException e) {
            future.cancel(true);
            throw new TimeoutException("Tool " + " timed out after " + timeoutMs + "ms");
        } catch (ExecutionException e) {
            Throwable cause = e.getCause();
            if (cause instanceof ToolNonRetryableException) {
                throw (ToolNonRetryableException) cause;
            }
            throw new ToolRetryableException(cause.getMessage(), cause);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new ToolRetryableException("Interrupted", e);
        }
    }
}

2.3 降级策略注册

@Component
public class ToolFallbackRegistry {
    
    private final Map<String, ToolFallback> fallbacks = new HashMap<>();
    
    @PostConstruct
    public void registerFallbacks() {
        // 天气工具降级：返回缓存的天气或提示无法获取
        fallbacks.put("get_weather", arguments -> {
            try {
                JsonNode args = objectMapper.readTree(arguments);
                String location = args.get("location").asText();
                
                // 尝试读取缓存（Redis中的上次结果）
                String cached = redisTemplate.opsForValue()
                        .get("weather:cache:" + location);
                if (cached != null) {
                    return "{\"source\":\"cache\"," + cached.substring(1);
                }
                
                return objectMapper.writeValueAsString(Map.of(
                        "error", "weather_service_unavailable",
                        "message", "天气服务暂时不可用，建议用户稍后查询"
                ));
            } catch (Exception e) {
                return "{\"error\": \"fallback_failed\"}";
            }
        });
        
        // SQL查询工具没有降级（不能用旧数据替代）
        // fallbacks.put("execute_readonly_sql", ...) -- 不注册，表示无降级
    }
    
    public ToolFallback getFallback(String toolName) {
        return fallbacks.get(toolName);
    }
}

@FunctionalInterface
public interface ToolFallback {
    String execute(String arguments) throws Exception;
}

2.4 将执行结果反馈给LLM

// 工具执行结果的类型影响LLM的后续行为
public record ToolExecutionResult(
        String callId,
        ResultType type,
        String content
) {
    public enum ResultType { SUCCESS, BUSINESS_ERROR, FALLBACK, SYSTEM_ERROR }
    
    public static ToolExecutionResult success(String callId, String content) {
        return new ToolExecutionResult(callId, ResultType.SUCCESS, content);
    }
    
    public static ToolExecutionResult businessError(String callId, String errorMsg) {
        // 让LLM知道是业务错误（参数问题等），可以告知用户
        return new ToolExecutionResult(callId, ResultType.BUSINESS_ERROR,
                objectMapper.writeValueAsString(Map.of(
                        "success", false,
                        "error_type", "business_error",
                        "message", errorMsg
                )));
    }
    
    public static ToolExecutionResult fallback(String callId, String fallbackContent) {
        // 告知LLM这是降级数据，不是实时数据
        return new ToolExecutionResult(callId, ResultType.FALLBACK,
                "{\"is_cached\": true, \"data\": " + fallbackContent + "}");
    }
    
    public static ToolExecutionResult systemError(String callId, String errorMsg) {
        // 告知LLM工具系统性失败，让LLM告知用户
        return new ToolExecutionResult(callId, ResultType.SYSTEM_ERROR,
                objectMapper.writeValueAsString(Map.of(
                        "success", false,
                        "error_type", "service_unavailable",
                        "message", "服务暂时不可用"
                )));
    }
}

四、踩坑实录

坑1：重试导致幂等性问题

对于有副作用的工具（发邮件、创建订单），超时后重试可能导致重复操作。

原则：只对幂等工具启用重试（查询类），非幂等工具（写操作）超时后只记录日志，让LLM知道"操作可能已执行但未确认状态"。

坑2：固定超时时间不适合所有场景

SQL查询有时0.1秒，有时30秒，固定超时会误杀正常的慢查询。

更好的方案：基于历史P99响应时间动态设置超时：

// 使用Micrometer收集工具响应时间
Timer timer = meterRegistry.timer("tool.execution", "tool", toolName);
double p99 = timer.percentile(0.99) / 1e6;  // 纳秒转毫秒
long dynamicTimeout = (long) Math.max(p99 * 3, MIN_TIMEOUT_MS);

坑3：重试风暴：所有工具同时超时并重试

当下游服务不稳定时，大量工具同时超时并重试，会对下游造成更大压力，形成雪崩。

解决：加熔断器（Circuit Breaker）：

@Component
public class ToolCircuitBreaker {
    
    // 使用Resilience4j
    private final Map<String, CircuitBreaker> breakers = new ConcurrentHashMap<>();
    
    public CircuitBreaker getBreaker(String toolName) {
        return breakers.computeIfAbsent(toolName, name -> 
                CircuitBreaker.of(name, CircuitBreakerConfig.custom()
                        .failureRateThreshold(50)      // 50%失败率触发开路
                        .waitDurationInOpenState(Duration.ofSeconds(30))  // 开路30秒
                        .slidingWindowSize(10)          // 统计最近10次
                        .build()));
    }
    
    public String executeWithBreaker(String toolName, Supplier<String> execution) {
        CircuitBreaker breaker = getBreaker(toolName);
        try {
            return breaker.executeSupplier(execution);
        } catch (CallNotPermittedException e) {
            return "{\"error\": \"service_circuit_open\", \"message\": \"服务熔断中，请稍后再试\"}";
        }
    }
}

坑4：LLM看到"tool_error"后产生幻觉

当工具返回错误时，LLM可能"猜测"结果（幻觉），而不是如实告诉用户工具失败了。

需要在system prompt里强调：

当工具返回错误（error_type字段不为null），你必须直接告知用户工具调用失败，
不能猜测或伪造工具结果。如实描述问题，并提供替代建议。

五、总结与延伸

生产级工具调用可靠性设计要点：

失败类型	策略
超时	指数退避重试（仅幂等工具）
网络错误	重试
业务错误	不重试，返回错误描述给LLM
系统错误	降级 → 告警
服务不可用	熔断保护

五层防护：超时控制 → 重试策略 → 降级策略 → 熔断保护 → 错误告警

下一篇最后一个技术篇：Function Call与数据库事务的整合，工具执行失败时的回滚策略。