第1716篇：AI应用的端到端测试策略——Testcontainers集成真实模型的方案

老张2026/4/30大约 8 分钟

第1716篇：AI应用的端到端测试策略——Testcontainers集成真实模型的方案

端到端测试这个词谁都会说，但AI应用的端到端测试有个很尴尬的问题：你到底要不要接真实的LLM？接了，贵、慢、不稳定；不接，测的还是Mock，假的。

我在几个项目里试过不同方案，最后形成了一套还算实用的策略。核心思路是：用Testcontainers跑本地小模型，作为真实LLM的替代品，在不花钱、可重复、够真实之间找平衡。

一、AI端到端测试的特殊挑战

传统端到端测试的挑战已经很多了——环境隔离、数据准备、测试速度。AI应用在这基础上还多了几个：

挑战1：LLM的非确定性 同样的输入，每次输出都可能不一样（温度参数）。传统的"期望值等于实际值"断言根本不管用。

挑战2：测试成本 一个端到端测试用例调用真实GPT-4，可能要花几分钱。跑1000个用例就是几十美元。CI跑多了，账单很好看。

挑战3：网络依赖 依赖外部API，网络不稳定时测试随机挂，CI不可信。

挑战4：速度 GPT-4单个请求几秒到几十秒，1000个端到端用例可能要几个小时。

Testcontainers + 本地小模型的组合，能解决其中的2、3、4，对于非确定性问题也有所缓解（本地模型temperature=0时相对稳定）。

二、Testcontainers + Ollama方案

Ollama是一个可以在本地跑开源LLM的工具，官方提供了Docker镜像，非常适合集成测试。

<!-- 依赖配置 -->
<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>testcontainers</artifactId>
    <version>1.19.6</version>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>junit-jupiter</artifactId>
    <version>1.19.6</version>
    <scope>test</scope>
</dependency>
<!-- Ollama Testcontainer（社区维护） -->
<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>ollama</artifactId>
    <version>1.19.6</version>
    <scope>test</scope>
</dependency>

三、Ollama容器基础配置

// 全局共享的Ollama容器（避免每个测试类都重启）
public abstract class AbstractE2ETest {

    // 使用JUnit扩展的生命周期管理
    @Container
    static OllamaContainer ollama = new OllamaContainer("ollama/ollama:latest")
            .withReuse(true);  // 跨测试类复用容器，加快速度
    
    // 一次性初始化：拉取并预热模型
    @BeforeAll
    static void initModel() throws Exception {
        // 拉取轻量级模型（适合端到端测试）
        // llama3.2:3b 大约2GB，推理速度快
        ExecResult result = ollama.execInContainer("ollama", "pull", "llama3.2:3b");
        if (result.getExitCode() != 0) {
            throw new RuntimeException("拉取模型失败：" + result.getStderr());
        }
        
        // 预热模型（第一次推理会加载模型，比较慢）
        String ollamaBaseUrl = "http://" + ollama.getHost() + ":" + ollama.getMappedPort(11434);
        warmupModel(ollamaBaseUrl);
    }
    
    private static void warmupModel(String baseUrl) {
        // 发一个简单请求预热模型
        RestTemplate rt = new RestTemplate();
        Map<String, Object> req = Map.of(
            "model", "llama3.2:3b",
            "prompt", "hello",
            "stream", false
        );
        rt.postForObject(baseUrl + "/api/generate", req, Map.class);
    }
    
    protected String getOllamaBaseUrl() {
        return "http://" + ollama.getHost() + ":" + ollama.getMappedPort(11434);
    }
}

四、Spring Boot集成Testcontainers

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Testcontainers
class AiServiceE2ETest extends AbstractE2ETest {

    @LocalServerPort
    private int port;
    
    @DynamicPropertySource
    static void configureProperties(DynamicPropertyRegistry registry) {
        // 把应用的LLM配置指向Ollama容器
        registry.add("ai.llm.base-url", 
            () -> "http://" + ollama.getHost() + ":" + ollama.getMappedPort(11434));
        registry.add("ai.llm.model", () -> "llama3.2:3b");
        registry.add("ai.llm.temperature", () -> "0.0");  // 固定温度，提高确定性
    }
    
    @Container
    static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:16")
            .withDatabaseName("testdb")
            .withUsername("test")
            .withPassword("test");
    
    @Container
    static GenericContainer<?> redis = new GenericContainer<>("redis:7-alpine")
            .withExposedPorts(6379);
    
    @DynamicPropertySource
    static void configureDbAndCache(DynamicPropertyRegistry registry) {
        registry.add("spring.datasource.url", postgres::getJdbcUrl);
        registry.add("spring.datasource.username", postgres::getUsername);
        registry.add("spring.datasource.password", postgres::getPassword);
        registry.add("spring.redis.host", redis::getHost);
        registry.add("spring.redis.port", () -> redis.getMappedPort(6379));
    }
    
    @Autowired
    private TestRestTemplate restTemplate;

    // 完整的端到端测试：从HTTP请求到LLM调用再到数据库存储
    @Test
    void testFullSentimentAnalysisFlow() {
        // 1. 提交分析请求
        AnalysisRequest request = new AnalysisRequest();
        request.setText("这个AI工具真的帮我省了很多时间，强烈推荐！");
        request.setLanguage("zh");
        
        ResponseEntity<AnalysisResponse> response = restTemplate.postForEntity(
            "/api/v1/analyze",
            request,
            AnalysisResponse.class
        );
        
        // 2. 验证HTTP响应结构
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
        AnalysisResponse body = response.getBody();
        assertThat(body).isNotNull();
        assertThat(body.getRequestId()).isNotBlank();
        
        // 3. 验证AI分析结果的约束（不验证具体值，验证结构和范围）
        assertThat(body.getSentiment()).isIn("positive", "negative", "neutral");
        assertThat(body.getScore()).isBetween(0.0, 1.0);
        assertThat(body.getSummary()).isNotBlank();
        
        // 4. 验证数据持久化（结果是否存入数据库）
        ResponseEntity<AnalysisResponse> fetchResponse = restTemplate.getForEntity(
            "/api/v1/analyze/" + body.getRequestId(),
            AnalysisResponse.class
        );
        assertThat(fetchResponse.getStatusCode()).isEqualTo(HttpStatus.OK);
        assertThat(fetchResponse.getBody().getRequestId()).isEqualTo(body.getRequestId());
    }
}

五、更复杂的端到端场景：RAG流程

RAG（检索增强生成）的端到端测试要复杂得多，涉及向量数据库：

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Testcontainers
class RagE2ETest extends AbstractE2ETest {

    // 使用pgvector作为向量数据库
    @Container
    static PostgreSQLContainer<?> pgvector = new PostgreSQLContainer<>("pgvector/pgvector:pg16")
            .withDatabaseName("ragdb")
            .withUsername("test")
            .withPassword("test");
    
    @DynamicPropertySource
    static void configure(DynamicPropertyRegistry registry) {
        registry.add("ai.llm.base-url", 
            () -> "http://" + ollama.getHost() + ":" + ollama.getMappedPort(11434));
        registry.add("ai.llm.model", () -> "llama3.2:3b");
        registry.add("spring.datasource.url", pgvector::getJdbcUrl);
        // 其他配置...
    }
    
    @Autowired
    private TestRestTemplate restTemplate;
    
    @Autowired
    private DocumentIndexer documentIndexer;
    
    // 先准备测试数据：建立知识库
    @BeforeEach
    void setupKnowledgeBase() {
        // 索引几篇测试文档
        documentIndexer.index(TestDocument.builder()
            .id("doc-001")
            .title("公司AI政策")
            .content("公司禁止在未经审批的情况下使用外部AI服务处理敏感数据。" +
                     "所有AI应用必须通过安全评审。")
            .build());
            
        documentIndexer.index(TestDocument.builder()
            .id("doc-002")
            .title("数据分类规范")
            .content("数据按敏感程度分为L1（公开）、L2（内部）、L3（机密）三级。" +
                     "L3数据不得出境。")
            .build());
        
        // 等待索引完成
        await().atMost(10, SECONDS).until(() -> documentIndexer.isIndexed("doc-001"));
    }
    
    @Test
    void testRagQueryFlow() {
        // 发起RAG查询
        RagQueryRequest request = RagQueryRequest.builder()
                .query("公司对AI工具的使用有什么规定？")
                .topK(3)
                .build();
        
        ResponseEntity<RagQueryResponse> response = restTemplate.postForEntity(
            "/api/v1/rag/query",
            request,
            RagQueryResponse.class
        );
        
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
        
        RagQueryResponse body = response.getBody();
        assertThat(body.getAnswer()).isNotBlank();
        
        // 验证RAG的引用来源（应该引用到doc-001）
        assertThat(body.getSources())
                .extracting(SourceDocument::getDocumentId)
                .contains("doc-001");
        
        // 验证答案的相关性——应该包含关键概念
        String answer = body.getAnswer().toLowerCase();
        boolean containsRelevantInfo = answer.contains("审批") || 
                                        answer.contains("安全") || 
                                        answer.contains("ai");
        assertThat(containsRelevantInfo)
                .as("RAG答案应该包含与AI使用政策相关的内容")
                .isTrue();
    }
    
    @Test
    void testRagWithIrrelevantQuery() {
        // 查询知识库里没有的内容
        RagQueryRequest request = RagQueryRequest.builder()
                .query("今天天气怎么样？")
                .topK(3)
                .build();
        
        ResponseEntity<RagQueryResponse> response = restTemplate.postForEntity(
            "/api/v1/rag/query",
            request,
            RagQueryResponse.class
        );
        
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
        
        RagQueryResponse body = response.getBody();
        
        // 对于知识库里没有的问题，系统应该优雅告知（不能乱编）
        assertThat(body.getConfidence()).isLessThan(0.5);
        assertThat(body.getSources()).isEmpty();
    }
}

六、端到端测试的分层策略

不是所有场景都要端到端，需要分层：

对应的JUnit标签和CI策略：

// 不同层级的测试用不同Tag标记
@Tag("unit")       // L1：每次提交都跑
@Tag("integration") // L2：PR合并前跑
@Tag("e2e")        // L3：每日构建跑
@Tag("acceptance") // L4：发布前手动触发

// 在CI里按Tag过滤

# GitHub Actions分层运行
name: CI

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Unit Tests
        run: mvn test -Dgroups="unit"
  
  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - name: Run Integration Tests
        run: mvn test -Dgroups="integration"
  
  e2e-tests:
    runs-on: ubuntu-latest
    needs: integration-tests
    if: github.ref == 'refs/heads/main'  # 只在main分支运行
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Docker
        uses: docker/setup-buildx-action@v3
      
      # 提前拉取Ollama镜像，加快测试速度
      - name: Pull Ollama Image
        run: docker pull ollama/ollama:latest
      
      - name: Run E2E Tests with Ollama
        run: mvn test -Dgroups="e2e" -Dtestcontainers.reuse.enable=true
        env:
          TESTCONTAINERS_RYUK_DISABLED: true  # 在CI里禁用Ryuk

七、容器镜像缓存优化

每次CI都要从网上拉模型文件，几个GB，太慢了。用镜像缓存解决：

// 自定义Ollama容器，支持预装模型
public class PreloadedOllamaContainer extends OllamaContainer {
    
    private final List<String> modelsToPreload;
    
    public PreloadedOllamaContainer(String... models) {
        super("ollama/ollama:latest");
        this.modelsToPreload = Arrays.asList(models);
        
        // 挂载本地模型缓存目录
        String modelCacheDir = System.getenv().getOrDefault(
            "OLLAMA_MODEL_CACHE", 
            System.getProperty("user.home") + "/.ollama/models"
        );
        
        withFileSystemBind(
            modelCacheDir,
            "/root/.ollama/models",
            BindMode.READ_WRITE
        );
    }
    
    @Override
    protected void containerIsStarted(InspectContainerResponse containerInfo) {
        // 容器启动后，确保模型已加载
        for (String model : modelsToPreload) {
            try {
                ExecResult result = execInContainer("ollama", "run", model, "--keepalive", "1h");
                if (result.getExitCode() != 0) {
                    // 如果本地没有，才从网络拉取
                    execInContainer("ollama", "pull", model);
                }
            } catch (Exception e) {
                throw new RuntimeException("模型预加载失败：" + model, e);
            }
        }
    }
}

// 使用预装模型的容器
public abstract class AbstractE2ETest {
    
    @Container
    static PreloadedOllamaContainer ollama = new PreloadedOllamaContainer(
        "llama3.2:3b",     // 主模型
        "nomic-embed-text" // Embedding模型（RAG用）
    ).withReuse(true);
}

八、测试隔离与数据清理

端到端测试里最烦的问题之一是数据污染，上一个测试的数据影响下一个：

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Transactional  // 测试后自动回滚数据库操作
class IsolatedE2ETest extends AbstractE2ETest {
    
    @Autowired
    private DatabaseCleaner databaseCleaner;
    
    @Autowired
    private CacheManager cacheManager;
    
    @BeforeEach
    void cleanUp() {
        // 清理非事务性资源
        databaseCleaner.cleanAllTables();
        cacheManager.getCacheNames().forEach(name -> 
            cacheManager.getCache(name).clear()
        );
    }
    
    // 对于向量数据库（不支持事务），手动清理
    @AfterEach
    void cleanVectorStore(@Autowired VectorStore vectorStore) {
        vectorStore.delete(FilterExpressionBuilder.eq("test_run", currentTestRunId));
    }
}

// 数据清理工具类
@Component
@Profile("test")
public class DatabaseCleaner {
    
    @Autowired
    private JdbcTemplate jdbcTemplate;
    
    private static final List<String> TABLES_TO_CLEAN = List.of(
        "ai_analysis_results",
        "conversation_history",
        "document_chunks",
        "user_sessions"
    );
    
    @Transactional(propagation = Propagation.REQUIRES_NEW)
    public void cleanAllTables() {
        jdbcTemplate.execute("SET CONSTRAINTS ALL DEFERRED");
        TABLES_TO_CLEAN.forEach(table -> 
            jdbcTemplate.execute("TRUNCATE TABLE " + table + " CASCADE")
        );
        jdbcTemplate.execute("SET CONSTRAINTS ALL IMMEDIATE");
    }
}

九、本地小模型的局限性

用Ollama + 小模型做端到端测试有个根本限制得说清楚：小模型（3B/7B）的能力和GPT-4有巨大差距。

什么场景下这个限制是可接受的：

测试的是系统流程和集成，不是AI智能
输入输出格式相对简单，小模型能处理
只需要验证结构正确性，不验证内容质量

什么场景下必须用真实模型：

验证复杂推理能力（多步逻辑、数学计算）
验证代码生成质量
验证多轮对话的上下文理解

所以我的建议是：Ollama的端到端测试覆盖流程正确性，真实模型的测试（L4）覆盖能力质量。两者互补，不是替代关系。

总结

AI应用的端到端测试没有银弹，但Testcontainers + Ollama这个组合在工程实践中确实好用：

免费，不消耗Token
可重复，结果稳定
速度可控，比真实LLM快得多
隔离性好，不依赖外部服务

关键是要想清楚你在测什么：系统集成和流程，还是AI能力本身？前者用本地模型完全够，后者需要真实模型。