第1717篇：测试替身（Test Double）在AI调用中的设计——Mock、Stub与Fake的选择

老张2026/4/30大约 9 分钟

第1717篇：测试替身（Test Double）在AI调用中的设计——Mock、Stub与Fake的选择

我入行时有个前辈说过一句话，一直记着：「测试写得好不好，一半看断言，另一半看你怎么隔离依赖。」

AI应用里，LLM就是那个最难隔离的依赖。它慢、贵、不确定，还会超时。如果你的每个单元测试都要真实调用LLM，测试套件就会变成一场噩梦。

所以今天来认真讲讲：在AI调用场景里，测试替身（Test Double）该怎么设计，Mock、Stub、Fake各自该用在哪里。

一、测试替身的家族谱系

先把这几个概念区分清楚，很多人混用：

Dummy：填位置用的，根本不被调用。比如某个方法签名要求传LlmClient，但测试里根本不走那段逻辑，传null或空实现即可。
Stub：返回预设响应，不做任何逻辑验证。"你问啥我都返回这个"。
Mock：不只返回值，还验证调用行为（是否被调用、调用了几次、参数是什么）。
Fake：有真实逻辑的轻量级实现，通常是为了测试方便特意写的简化版。
Spy：包裹真实对象，透明代理，但可以记录调用信息。

Gerard Meszaros在《xUnit Test Patterns》里把这五种统称为Test Double（测试替身），Mock只是其中一种，但现实里大家通常把所有这些都叫"Mock"。

在AI应用里，这几种替身的适用场景各不相同。

二、Stub：最简单的LLM替代

Stub用于"我只关心某个特定输入返回什么，其他不在乎"的场景。

// Stub的基础用法——用Mockito
@ExtendWith(MockitoExtension.class)
class SentimentServiceStubTest {
    
    @Mock
    private LlmClient llmClient;  // 这是Stub，Mockito也管它叫Mock
    
    @InjectMocks
    private SentimentAnalysisService service;
    
    @Test
    void testPositiveSentimentParsing() {
        // 设置Stub行为：只管返回，不管验证
        when(llmClient.complete(any(), any()))
            .thenReturn("""
                {
                    "sentiment": "positive",
                    "score": 0.92,
                    "keywords": ["好用", "推荐"],
                    "reasoning": "用户表达了强烈的满意情绪"
                }
                """);
        
        SentimentResult result = service.analyze("这产品真的太好了！");
        
        assertThat(result.getLabel()).isEqualTo("positive");
        assertThat(result.getScore()).isEqualTo(0.92);
    }
    
    @Test
    void testServiceHandlesLlmTimeout() {
        // Stub超时场景
        when(llmClient.complete(any(), any()))
            .thenThrow(new LlmTimeoutException("LLM响应超时"));
        
        SentimentResult result = service.analyze("任意文本");
        
        // 验证降级行为
        assertThat(result.isFromCache()).isFalse();
        assertThat(result.getLabel()).isEqualTo("unknown");
        assertThat(result.getErrorMessage()).contains("timeout");
    }
}

什么时候用Stub：

测试业务逻辑如何处理特定的LLM响应
测试异常处理和降级逻辑
测试响应解析器

三、Mock：验证AI调用行为

Mock不只返回值，还验证交互行为。在AI应用里，你可能关心：

Prompt是否包含了必要的上下文？
超时参数是否正确设置了？
重试逻辑是否按预期执行？

@ExtendWith(MockitoExtension.class)
class LlmCallBehaviorTest {
    
    @Mock
    private LlmClient llmClient;
    
    @Mock
    private PromptBuilder promptBuilder;
    
    @InjectMocks
    private AiAnalysisService service;
    
    // 验证Prompt构建行为
    @Test
    void testPromptIncludesUserContext() {
        UserContext userContext = new UserContext("user-123", "zh", "professional");
        
        when(promptBuilder.buildAnalysisPrompt(any(), any())).thenReturn("built-prompt");
        when(llmClient.complete(any(), eq("built-prompt")))
            .thenReturn("{\"result\": \"ok\"}");
        
        service.analyze("分析这段文字", userContext);
        
        // 验证promptBuilder被正确调用：包含了用户上下文
        verify(promptBuilder).buildAnalysisPrompt(
            argThat(req -> req.getLanguage().equals("zh") 
                         && req.getTone().equals("professional")),
            any()
        );
    }
    
    // 验证重试逻辑
    @Test
    void testRetryOnTransientFailure() {
        // 前两次失败，第三次成功
        when(llmClient.complete(any(), any()))
            .thenThrow(new LlmTransientException("临时错误"))
            .thenThrow(new LlmTransientException("临时错误"))
            .thenReturn("{\"sentiment\": \"positive\", \"score\": 0.8}");
        
        SentimentResult result = service.analyze("测试文本");
        
        // 验证被调用了3次（2次失败 + 1次成功）
        verify(llmClient, times(3)).complete(any(), any());
        assertThat(result.getLabel()).isEqualTo("positive");
    }
    
    // 验证超过最大重试次数后放弃
    @Test
    void testGivesUpAfterMaxRetries() {
        when(llmClient.complete(any(), any()))
            .thenThrow(new LlmTransientException("持续失败"));
        
        assertThatThrownBy(() -> service.analyze("测试文本"))
            .isInstanceOf(AiServiceUnavailableException.class);
        
        // 验证最多重试了3次（取决于配置）
        verify(llmClient, times(3)).complete(any(), any());
    }
    
    // 验证不应该重试的情况（非临时性错误）
    @Test
    void testNoPermanentErrorRetry() {
        when(llmClient.complete(any(), any()))
            .thenThrow(new LlmAuthException("认证失败"));
        
        assertThatThrownBy(() -> service.analyze("测试文本"))
            .isInstanceOf(LlmAuthException.class);
        
        // 认证错误不应该重试
        verify(llmClient, times(1)).complete(any(), any());
    }
    
    // 验证缓存命中时不调用LLM
    @Test
    void testCacheHitSkipsLlmCall() {
        // 设置缓存返回结果
        when(cacheService.get(anyString()))
            .thenReturn(Optional.of(cachedResult()));
        
        service.analyze("已缓存的文本");
        
        // 验证LLM完全没被调用
        verifyNoInteractions(llmClient);
    }
}

四、Fake：更接近真实的LLM替代

Fake是真正有逻辑的轻量级实现，不同于Stub的"无脑返回"，Fake会根据输入做一些真实的处理。对AI应用来说，一个好的Fake LLM可以：

根据输入关键词返回合理的响应
模拟流式输出
模拟随机性（控制版本的随机）

// 一个设计良好的Fake LLM客户端
public class FakeLlmClient implements LlmClient {
    
    // 预设的关键词-响应映射
    private final Map<String, String> keywordResponses = new HashMap<>();
    
    // 可以注入的延迟（模拟真实网络延迟）
    private final Duration simulatedLatency;
    
    // 失败率（0.0-1.0，模拟随机失败）
    private final double failureRate;
    
    // 调用记录（供测试验证）
    private final List<LlmCall> callHistory = new ArrayList<>();
    
    public FakeLlmClient(Duration latency, double failureRate) {
        this.simulatedLatency = latency;
        this.failureRate = failureRate;
        initDefaultResponses();
    }
    
    private void initDefaultResponses() {
        // 情感分析场景的预设响应
        keywordResponses.put("积极|好|推荐|棒|优秀", 
            "{\"sentiment\":\"positive\",\"score\":0.85,\"keywords\":[\"好\"],\"reasoning\":\"积极词汇\"}");
        keywordResponses.put("差|烂|失望|退货|骗人", 
            "{\"sentiment\":\"negative\",\"score\":0.15,\"keywords\":[\"差\"],\"reasoning\":\"消极词汇\"}");
        // 默认响应
        keywordResponses.put(".*", 
            "{\"sentiment\":\"neutral\",\"score\":0.5,\"keywords\":[],\"reasoning\":\"无明显倾向\"}");
    }
    
    @Override
    public String complete(String systemPrompt, String userPrompt) {
        // 记录调用
        LlmCall call = new LlmCall(systemPrompt, userPrompt, Instant.now());
        callHistory.add(call);
        
        // 模拟延迟
        if (simulatedLatency != null && !simulatedLatency.isZero()) {
            try {
                Thread.sleep(simulatedLatency.toMillis());
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }
        
        // 模拟随机失败
        if (failureRate > 0 && Math.random() < failureRate) {
            throw new LlmTransientException("Fake LLM模拟随机失败");
        }
        
        // 根据输入内容匹配响应
        return findBestResponse(userPrompt);
    }
    
    @Override
    public Flux<String> stream(String systemPrompt, String userPrompt) {
        String fullResponse = complete(systemPrompt, userPrompt);
        
        // 模拟流式输出：逐字符发送
        return Flux.fromArray(fullResponse.chars()
            .mapToObj(c -> String.valueOf((char) c))
            .toArray(String[]::new))
            .delayElements(Duration.ofMillis(10)); // 每个字符延迟10ms
    }
    
    private String findBestResponse(String input) {
        for (Map.Entry<String, String> entry : keywordResponses.entrySet()) {
            if (!entry.getKey().equals(".*") && 
                input.matches(".*(" + entry.getKey() + ").*")) {
                return entry.getValue();
            }
        }
        return keywordResponses.get(".*"); // 返回默认响应
    }
    
    // 测试辅助方法
    public List<LlmCall> getCallHistory() {
        return Collections.unmodifiableList(callHistory);
    }
    
    public void clearHistory() {
        callHistory.clear();
    }
    
    public int getCallCount() {
        return callHistory.size();
    }
    
    // 动态添加响应规则
    public void addResponseRule(String keywordPattern, String response) {
        keywordResponses.put(keywordPattern, response);
    }
    
    // 注册DTO
    @Data
    @AllArgsConstructor
    public static class LlmCall {
        private String systemPrompt;
        private String userPrompt;
        private Instant timestamp;
    }
}

在测试里使用Fake：

@SpringBootTest
class ServiceWithFakeLlmTest {
    
    // 使用Spring的Bean替换
    @TestConfiguration
    static class FakeLlmConfig {
        @Bean
        @Primary  // 覆盖真实的LlmClient Bean
        public LlmClient fakeLlmClient() {
            return new FakeLlmClient(
                Duration.ofMillis(50),  // 50ms延迟
                0.0                     // 不模拟失败
            );
        }
    }
    
    @Autowired
    private SentimentAnalysisService service;
    
    @Autowired
    private LlmClient llmClient;  // 实际注入的是FakeLlmClient
    
    @Test
    void testWithFakeLlm() {
        SentimentResult result = service.analyze("这个产品真的太好了！");
        
        assertThat(result.getLabel()).isEqualTo("positive");
        
        // 通过Fake的调用历史验证行为
        FakeLlmClient fake = (FakeLlmClient) llmClient;
        assertThat(fake.getCallCount()).isEqualTo(1);
        assertThat(fake.getCallHistory().get(0).getUserPrompt())
            .contains("这个产品真的太好了！");
    }
    
    @Test
    void testHighLatencyScenario() {
        // 针对高延迟场景的测试
        FakeLlmClient slowFake = new FakeLlmClient(Duration.ofSeconds(5), 0.0);
        // 验证超时逻辑...
    }
}

五、Spy：包裹真实LLM调用

Spy适合你需要观察真实调用，但又需要部分拦截的场景。比如你想测试真实的LLM调用，但需要记录每次调用的Token消耗：

@Component
public class ObservableLlmClient implements LlmClient {
    
    private final LlmClient delegate;
    private final MetricsCollector metrics;
    private final CallRecorder recorder;
    
    @Override
    public String complete(String systemPrompt, String userPrompt) {
        long start = System.currentTimeMillis();
        
        try {
            String response = delegate.complete(systemPrompt, userPrompt);
            
            long elapsed = System.currentTimeMillis() - start;
            metrics.recordSuccess(elapsed, countTokens(response));
            recorder.record(systemPrompt, userPrompt, response, elapsed);
            
            return response;
        } catch (Exception e) {
            metrics.recordFailure(e.getClass().getSimpleName());
            throw e;
        }
    }
}

// 测试中部分拦截：真实调用 + 记录
@Test
void testWithSpyLlm() {
    LlmClient realClient = applicationContext.getBean("realLlmClient", LlmClient.class);
    LlmClient spy = spy(realClient);
    
    // 真实调用，但记录了所有调用
    service.setLlmClient(spy);
    service.analyze("测试文本");
    
    // 验证真实调用发生了
    verify(spy).complete(any(), contains("测试文本"));
}

六、不同场景的选型决策

实际选型原则：

场景	推荐替身	原因
测试响应解析逻辑	Stub	只需要固定的响应数据
测试Prompt是否构建正确	Mock	需要验证调用参数
测试重试/降级逻辑	Mock/Stub	需要精确控制失败时机
集成测试（不想接真实LLM）	Fake	需要有逻辑的响应
性能测试	Fake	可控的延迟和行为
调试生产问题	Spy	需要真实调用 + 记录

七、避免Over-Mocking

Mock用多了是有害的，特别是AI应用里。我见过的反模式：

反模式1：Mock内部实现细节

// 糟糕：Mock了太多内部细节
@Test
void badTest() {
    when(tokenizer.tokenize(any())).thenReturn(List.of("token1", "token2"));
    when(promptFormatter.format(any())).thenReturn("formatted-prompt");
    when(llmClient.complete(eq("system"), eq("formatted-prompt")))
        .thenReturn("response");
    
    service.analyze("text");
    
    verify(tokenizer).tokenize("text");
    verify(promptFormatter).format(any());
    verify(llmClient).complete(eq("system"), eq("formatted-prompt"));
    // 这个测试测的是实现，而不是行为
    // 任何重构都会导致这个测试挂，即使行为没变
}

// 好的写法：只Mock边界
@Test
void goodTest() {
    when(llmClient.complete(any(), any()))
        .thenReturn("{\"sentiment\":\"positive\",\"score\":0.9}");
    
    SentimentResult result = service.analyze("text");
    
    assertThat(result.getLabel()).isEqualTo("positive");
    // 只验证对外行为，不关心内部怎么实现的
}

反模式2：对LLM的调用次数过度敏感

// 糟糕：精确的调用次数验证可能是过度约束
verify(llmClient, times(1)).complete(any(), any());
// 如果以后加了缓存导致调用次数变化，这个测试会挂
// 但行为是对的

// 好的写法：只在有意义时验证次数
// 比如验证"不应该调用"
verifyNoInteractions(llmClient);
// 或者"至少调用一次"
verify(llmClient, atLeastOnce()).complete(any(), any());

八、测试替身的管理与复用

随着项目变大，Test Double的代码会越来越多，需要统一管理：

// 测试替身工厂
public class TestDoubleFactory {
    
    // 标准的Stub LLM（返回固定的情感分析结果）
    public static LlmClient standardSentimentStub() {
        LlmClient stub = mock(LlmClient.class);
        when(stub.complete(any(), any())).thenReturn(
            "{\"sentiment\":\"positive\",\"score\":0.75}"
        );
        return stub;
    }
    
    // 总是失败的Stub（测试降级）
    public static LlmClient alwaysFailingStub() {
        LlmClient stub = mock(LlmClient.class);
        when(stub.complete(any(), any()))
            .thenThrow(new LlmServiceException("模拟服务不可用"));
        return stub;
    }
    
    // 间歇性失败的Stub（测试重试）
    public static LlmClient intermittentFailureStub(int successOnAttempt) {
        LlmClient stub = mock(LlmClient.class);
        AtomicInteger callCount = new AtomicInteger(0);
        
        when(stub.complete(any(), any())).thenAnswer(inv -> {
            int attempt = callCount.incrementAndGet();
            if (attempt < successOnAttempt) {
                throw new LlmTransientException("第" + attempt + "次尝试失败");
            }
            return "{\"result\":\"success\"}";
        });
        return stub;
    }
    
    // 慢响应Stub（测试超时）
    public static LlmClient slowStub(Duration delay) {
        LlmClient stub = mock(LlmClient.class);
        when(stub.complete(any(), any())).thenAnswer(inv -> {
            Thread.sleep(delay.toMillis());
            return "{\"result\":\"ok\"}";
        });
        return stub;
    }
    
    // 完整的Fake（用于集成测试）
    public static FakeLlmClient fullFeaturedFake() {
        return new FakeLlmClient(Duration.ofMillis(100), 0.0);
    }
}

总结

Test Double在AI应用里的使用原则，一句话总结：用最简单的替身满足测试目的，不要过度Mock，不要Mock行为细节。

具体来说：

大多数情况用Stub就够了
需要验证调用行为（特别是Prompt构建）时用Mock
集成测试用Fake，它比Stub更可靠，比真实LLM更便宜
Spy留给调试场景

最重要的原则：Test Double的存在是为了让测试聚焦在你真正想测试的事情上，不要让它变成另一个需要维护的复杂系统。